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Abstract: This paper surveys various results about Markov chains on gen- 
eral (non-countable) state spaces. It begins with an introduction to Markov 
chain Monte Carlo (MCMC) algorithms, which provide the motivation and 
context for the theory which follows. Then, sufficient conditions for geomet- 
ric and uniform ergodicity are presented, along with quantitative bounds 
on the rate of convergence to stationarity. Many of these results are proved 
using direct coupling constructions based on minorisation and drift con- 
ditions. Necessary and sufficient conditions for Central Limit Theorems 
(CLTs) are also presented, in some cases proved via the Poisson Equa- 
tion or direct regeneration constructions. Finally, optimal scaling and weak 
convergence results for Metropolis-Hastings algorithms are discussed. None 
of the results presented is new, though many of the proofs are. We also 
describe some Open Problems. 
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1. Introduction 

Markov chain Monte Carlo (MCMC) algorithms - such as the Metropolis- 
Hastings algorithm ([53], [37]) and the Gibbs sampler (e.g. Geman and Ge- 
man |32| ; Gelfand and Smith [30j ) - have become extremely popular in statis- 
tics, as a way of approximately sampling from complicated probability distribu- 
tions in high dimensions (see for example the reviews [93], [89], [33], [71]). Most 
dramatically, the existence of MCMC algorithms has transformed Bayesian in- 
ference, by allowing practitioners to sample from posterior distributions of com- 
plicated statistical models. 

In addition to their importance to applications in statistics and other sub- 
jects, these algorithms also raise numerous questions related to probability the- 
ory and the mathematics of Markov chains. In particular, MCMC algorithms 
involve Markov chains {X n } having a (complicated) stationary distribution 7r(-), 
for which it is important to understand as precisely as possible the nature and 
speed of the convergence of the law of X n to n(-) as n increases. 

"This is an original survey paper. 
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This paper attempts to explain and summarise MCMC algorithms and the 
probability theory questions that they generate. After introducing the algo- 
rithms (Section [2]), we discuss various important theoretical questions related 
to them. In Section [3] we present various convergence rate results commonly 
used in MCMC. Most of these are proved in Section IU using direct coupling 
arguments and thereby avoiding many of the analytic technicalities of previous 
proofs. We consider MCMC central limit theorems in Section [5l and optimal 
scaling and weak convergence results in Section [6] Numerous references to the 
MCMC literature are given throughout. We also describe some Open Problems. 

1.1. The problem 

The problem addressed by MCMC algorithms is the following. We're given a 
density function n Ul on some state space X , which is possibly unnormalised but 
at least satisfies < J x n u < oo. (Typically X is an open subset of R d , and the 
densities are taken with respect to Lebesgue measure, though other settings - 
including discrete state spaces - are also possible.) This density gives rise to a 
probability measure 7r(-) on X , by 

We want to (say) estimate expectations of functions / : X — * R with respect 
to 7r(-), i.e. we want to estimate 

, , m \v f(x)ir u (x)dx , , 

ttJ =E,/X )= J * A ^ "V ■ (2) 
J x ir u (x)dx 

If X is high-dimensional, and n u is a complicated function, then direct integra- 
tion (either analytic or numerical) of the integrals in ([2|) is infeasible. 

The classical Monte Carlo solution to this problem is to simulate i.i.d. ran- 
dom variables Z\, Z-x, . . . , Zn ~ tt(-), and then estimate 7r(/) by 

JV 

n(f) = (l/N)J2f(Zi). (3) 



This gives an unbiased estimate, having standard deviation of order 0(l/vN). 
Furthermore, if 7r(/ 2 ) < oo, then by the classical Central Limit Theorem, the 
error 7r(/) — 7r(/) will have a limiting normal distribution, which is also useful. 
The problem, however, is that if tt u is complicated, then it is very difficult to 
directly simulate i.i.d. random variables from 7r(-). 

The Markov chain Monte Carlo (MCMC) solution is to instead construct a 
Markov chain on X which is easily run on a computer, and which has n(-) as 
a stationary distribution. That is, we want to define easily-simulated Markov 
chain transition probabilities P(x,dy) for x, y G X, such that 

Tr(dx) P(x, dy) = w(dy). (4) 
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Then hopefully (see Subsection 13. 2p , if we run the Markov chain for a long 
time (started from anywhere), then for large n the distribution of X n will be 
approximately stationary: C(X n ) m ir(-). We can then (say) set Z\ — X n , and 
then restart and rerun the Markov chain to obtain Z2, Z3, etc., and then do 
estimates as in ((3|). 

It may seem at first to be even more difficult to find such a Markov chain, 
then to estimate 7r(/) directly. However, we shall see in the next section that 
constructing (and running) such Markov chains is often surprisingly straightfor- 
ward. 

Remark. In the practical use of MCMC, rather than start a fresh Markov 
chain for each new sample, often an entire tail of the Markov chain run {^ n } is 
used to create an estimate such as (N — B)^ 1 J2i=B+i f(-X-i), where the burn-in 
value B is hopefully chosen large enough that C(Xb) ~ 7r(-). In that case the 
different f{Xi) are not independent, but the estimate can be computed more 
efficiently. Since many of the mathematical issues which arise are similar in 
either implementation, we largely ignore this modification herein. 

Remark. MCMC is, of course, not the only way to sample or estimate from 
complicated probability distributions. Other possible sampling algorithms in- 
clude "rejection sampling" and "importance sampling" , not reviewed here; but 
these alternative algorithms only work well in certain particular cases and are 
not as widely applicable as MCMC algorithms. 

1.2. Motivation: Bayesian Statistics Computations 

While MCMC algorithms are used in many fields (statistical physics, computer 
science), their most widespread application is in Bayesian statistical inference. 

Let L{y\&) be the likelihood function (i.e., density of data y given unknown 
parameters 9) of a statistical model, for 9 € X. (Usually X C R d .) Let the 
"prior" density of 9 be p(9). Then the "posterior" distribution of 9 given y is 
the density which is proportional to 

7r u (9) = L(y\9)p(9). 

(Of course, the normalisation constant is simply the density for the data y, 
though that constant may be impossible to compute.) The "posterior mean" of 
any functional / is then given by: 

, « = J x f(x)n u (x)dx 
f x ir u (x)dx 

For this reason, Bayesians are anxious (even desperate!) to estimate such 
7r(/). Good estimates allow Bayesian inference can be used to estimate a wide 
variety of parameters, probabilities, means, etc. MCMC has proven to be ex- 
tremely helpful for such Bayesian estimates, and MCMC is now extremely widely 
used in the Bayesian statistical community. 
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2. Constructing MCMC Algorithms 

We see from the above that an MCMC algorithm requires, given a probability 
distribution tt(-) on a state space X, a Markov chain on X which is easily 
run on a computer, and which has 7r(-) as its stationary distribution as in (j4|). 
This section explains how such Markov chains are constructed. It thus provides 
motivation and context for the theory which follows; however, for the reader 
interested purely in the mathematical results, this section can be omitted with 
little loss of continuity. 

A key notion is reversibility, as follows. 

Definition. A Markov chain on a state space X is reversible with respect to a 
probability distribution 7r(-) on X, if 

ir(dx) P(x, dy) = ir(dy) P(y, dx), x, y € X. 

A very important property of reversibility is the following. 

Proposition 1. If Markov chain is reversible with respect to tt(-), then n{-) is 
stationary for the chain. 

Proof. We compute that 

/ n(dx) P(x,dy) = n(dy) P(y,dx) = ir(dy) / P(y,dx) = n(dy). 
Jxex Jxex Jxex 

□ 

We see from this lemma that, when constructing an MCMC algorithm, it 
suffices to create a Markov chain which is easily run, and which is reversible 
with respect to 7r(-). The simplest way to do so is to use the Metropolis-Hastings 
algorithm, as we now discuss. 



2.1. The Metropolis-Hastings Algorithm 

Suppose again that 7r(-) has a (possibly unnormalised) density 7r u , as in ([I}. 
Let Q(x, •) be essentially any other Markov chain, whose transitions also have 
a (possibly unnormalised) density, i.e. Q(x 1 dy) cx q(x,y)dy. 

The Metropolis-Hastings algorithm proceeds as follows. First choose some Xq. 
Then, given X ni generate a proposal Y n+ \ from Q(X n , ■). Also flip an indepen- 
dent coin, whose probability of heads equals a(X n , Y n+ \), where 



a(x,y) 



1, 



ir u (x)q(x,y) 



(To avoid ambiguity, we set a(x,y) — 1 whenever ir(x) q(x, y) = 0.) Then, if 
the coin is heads, "accept" the proposal by setting X n+ i = Y n+ i; if the coin is 
tails then "reject" the proposal by setting X n+ i = X n . Replace n by n + 1 and 
repeat. 

The reason for the unusual formula for a(x, y) is the following: 
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c 1 7r tl (x) q(x, y) min 



1, 



dx dy 



Proposition 2. The Metropolis-Hastings algorithm (as described above) pro- 
duces a Markov chain {X n } which is reversible with respect to ir(-). 

Proof. We need to show 

■n(dx) P(x, dy) = Tr(dy) P(y, dx). 

It suffices to assume x ^ y (since if x = y then the equation is trivial). But for 
x ^ y, setting c = J x ir u (x) dx, 

ir(dx) P(x, dy) = [c^ 1 tt u (x) dx] [q(x, y) a(x, y) dy] 

TTu(y)q(y,x) ' 

ir u (x)q(x,y) 
= c" 1 mm[ir u (x) q(x,y), n u (y)q(y,x)]dxdy, 

which is symmetric in x and y. □ 

To run the Metropolis-Hastings algorithm on a computer, we just need to 
be able to run the proposal chain Q(x, ■) (which is easy, for appropriate choices 
of Q), and then do the accept /reject step (which is easy, provided we can easily 
compute the densities at individual points). Thus, running the algorithm is 
quite feasible. Furthermore we need to compute only ratios of densities [e.g. 
TT u (y) I tt u (x)], so we don't require the normalising constants c = J x ir u (x)dx. 

However, this algorithm in turn suggests further questions. Most obviously, 
how should we choose the proposal distributions Q{x, •)? In addition, once 
Q{x, ■) is chosen, then will we really have C{X n ) w 7r(-) for large enough nl 
How large is large enough? We will return to these questions below. 

Regarding the first question, there are many different classes of ways of 
choosing the proposal density, such as: 

• Symmetric Metropolis Algorithm. Here q(x,y) = q(y,x), and the ac- 
ceptance probability simplifies to 



a(x, y) = min 



1 nu(y) 
n u (x) 



•Random walk Metropolis-Hastings. Here q(x,y) = q(y — x). For ex- 
ample, perhaps Q(x, ■) — N(x, a 2 ), or Q(x, ■) — Uniform(x — 1, x + 1). 

•Independence sampler. Here q(x,y) = q(y), i.e. Q(x, ■) does not depend 
on x. 

•Langevin algorithm. Here the proposal is generated by 

Y n+1 ~ N(X n + (5/2) VIog7r(X n ), S), 

for some (small) 6 > 0. (This is motivated by a discrete approximation to a 
Langevin diffusion processes.) 

More about optimal choices of proposal distributions will be discussed in a 
later section, as will the second question about time to stationarity (i.e. how 
large does n need to be). 
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2.2. Combining Chains 

If Pi and P2 are two different chains, each having stationary distribution tt(-), 
then the new chain P1P2 also has stationary distribution tt(-). 

Thus, it is perfectly acceptable, and quite common (see e.g. Tierney [93] and 
[59"]). to make new MCMC algorithms out of old ones, by specifying that the 
new algorithm applies first the chain Pi, then the chain P2, then the chain Pi 
again, etc. (And, more generally, it is possible to combine many different chains 
in this manner.) 

Note that, even if each of Pi and Pi are reversible, the combined chain P1P2 
will in general not be reversible. It is for this reason that it is important, when 
studying MCMC, to allow for non-reversible chains as well. 



2.3. The Gibbs Sampler 

The Gibbs sampler is also known as the "heat bath" algorithm, or as "Glauber 
dynamics". Suppose again that 7r u (-) is d-dimensional density, with X an open 
subset of R d , and write x = (xi, . . . , Xd)- 

The i th component Gibbs sampler is defined such that Pi leaves all 
components besides i unchanged, and replaces the i th component by a draw from 
the full conditional distribution of ir(-) conditional on all the other components. 

More formally, let 

S x ,i, a ,b = {y e X] yj = Xj for j ^ i, and a < yi < b}. 

Then 

„ / „ \ J a (^1 , ■ ■ ■ , Xi— 1 , t, Xi+\ , ■ ■ ■ , x n ) dt 

Jx,i,a,b) = "Too 7 " "TrU' ° 

J_ 0o j • • • ) >Ei — l i ^2+1 ) • • • 5 <^n ) OA' 

It follows immediately (from direct computation, or from the definition of 
conditional density), that P^, is reversible with respect to 7r(-). (In fact, Pi 
may be regarded as a special case of a Metropolis-Hastings algorithm, with 
a(x,y) = 1.) Hence, Pj has 7r(-) as a stationary distribution. 

We then construct the full Gibbs sampler out of the various P, , by combining 
them (as in the previous subsection) in one of two ways: 

•The deterministic-scan Gibbs sampler is 

P = P 1 P 2 ...P d . 

That is, it performs the d different Gibbs sampler components, in sequential 
order. 



•The random-scan Gibbs sampler is 

d 



i=l 
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That is, it does one of the d different Gibbs sampler components, chosen uni- 
formly at random. 

Either version produces an MCMC algorithm having 7r(-) as its stationary 
distribution. The output of a Gibbs sampler is thus a "zig-zag pattern" , where 
the components get updated one at a time. (Also, the random-scan Gibbs sam- 
pler is reversible, while the deterministic-scan Gibbs sampler usually is not.) 

2-4- Detailed Bayesian Example: Variance Components Model 

We close this section by presenting a typical example of a target density tt u that 
arises in Bayesian statistics, in an effort to illustrate the problems and issues 
which arise. 

The model involves fixed constant /io and positive constants a%, b\, ai, 62, and 
Co- It involves three hyperparameters, <r 2 , <r 2 , and /j,, each having priors based 
upon these constants as follows: ct 2 ~ IG{a\,b\); er 2 <~ 7^(02,62); and 
[i <~ N(/io, Op). It involves K further parameters 0\, 62, ■ ■ ■ , Ok, conditionally 
independent given the above hyperparameters, with Oi ~ N(fj,, Oq). In terms of 
these parameters, the data {Yij} (1 < i < K, 1 < j < J) are assumed to be dis- 
tributed as ~ N(8i,al), conditionally independently given the parameters. 
A graphical representation of the model is as follows: 



The Bayesian paradigm then involves conditioning on the values of the data 
{Yij}, and considering the joint distribution of all K + 3 parameters given this 
data. That is, we are interested in the distribution 



defined on the state space X = (0, oo) 2 x R K+1 . We would like to sample from 
this distribution 7r(-). We compute that this distribution's unnormalised density 
is given by 



/I \ 

I i 

Y 1U ...,Y 13 Y K1 ,...,Y KJ 



Y^N(0i,a 2 e ) 



TT(-)=jC(ala 2 e ,n,6 1 ,...,e K \{Y ij }) 



Ku{o-1,cI^,0it--,0k) oc 



e -6i/^ (T 2-«i-l e -6 2 / CTe 2 a 2-«2-l e -( ([1 - ([10 )2/2^ 



K K J 




i—1 i—1 j—1 
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This is a very typical target density for MCMC in statistics, in that it is high- 
dimensional (K + 3), its formula is messy and irregular, it is positive throughout 
X, and it is larger in "center" of X and smaller in "tails" of X . 

We now consider constructing MCMC algorithms to sample from the target 
density ir u . We begin with the Gibbs sampler. To run a Gibbs sampler, we 
require the full conditionals distributions, computed (without difficulty since 
they are all one-dimensional) to be as follows: 

C(a 2 | fx, a 2 e , 0!, . . . , K , Y l3 ) = IgU + ±K, h + ~ - m) 2 ) ; 

L(ol | fi, a 2 , 6 K , Y %3 ) = IgL 2 + -KJ, b 2 + \ ^(Yy - 



£(»\ala 2 e ,e 1 ,...,9 K ,Y ij ) = N 



2 2 

ho 



r(o l n 2 n 2 ft ft ft v \ - n( Jcr2 e Y * + a2 eV a l a l 



J(J 2 + a 2 Jo 2 + ol 

where Yi — jE^i^ii anc ^ ^ ne l &s t equation holds for 1 < i < K. The 
Gibbs sampler then proceeds by updating the K + 3 variables, in turn (cither 
deterministic or random scan), according to the above conditional distributions. 
This is feasible since the conditional distributions are all easily simulated (IG 
and N). In fact, it appears to work well, both in practice and according to various 
theoretical results; this model was one of the early statistical applications of the 
Gibbs sampler by Gelfand and Smith [30 , and versions of it have been used and 
studied often (see e.g. [79], [57], [82], [20], [44], [45]). 

Alternatively, we can run a Metropolis-Hastings algorithm for this model. 
For example, we might choose a symmetric random-walk Metropolis algorithm 
with proposals of the form N(X n , <j 2 Ik+3) for some a 1 > (say). Then, given 
X n , the algorithm would proceed as follows: 

1. Choose Y n+1 - N(X n ,a 2 I K +3); 

2. Choose U n+ \ ~ Uniform[0, 1]; 

3. If U n+ i < TT u (Y n+1 ) I TT u (X n ), then set X n+1 = Y n+1 (accept). Otherwise 
set X n+ i = X n (reject). 

This MCMC algorithm also appears to work well for this model, at least if 
the value of a 2 is chosen appropriately (as discussed in Section [5]). We conclude 
that, for such "typical" target distributions it(-), both the Gibbs sampler and 
appropriate Metropolis-Hastings algorithms perform well in practice, and allow 
us to sample from ir(-). 
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3. Bounds on Markov Chain Convergence Times 

Once we know how to construct (and run) lots of different MCMC algorithms, 
other questions arise. Most obviously, do they converge to the distribution 7r(-)? 
And, how quickly does this convergence take place? 

To proceed, write P n (x, A) for the n-step transition law of the Markov chain: 

P n {x,A)=P[X n €A\X = x]. 

The main MCMC convergence questions are, is P n (x, A) "close" to ir(A) for 
large enough nl And, how large is large enough? 

3.1. Total Variation Distance 

We shall measure the distance to stationary in terms of total variation distance, 
defined as follows: 

Definition. The total variation distance between two probability measures ^i(-) 
and v 2 (-) is: 

\\v 1 {-)-v 2 {-)\\=suv\v l {A)- V2 {A)\. 

A 

We can then ask, is lirrin^oo \\P n (x 1 ■) — 7r(-)|| = 0? And, given e > 0, how 
large must n be so that \\P n (x, •) — 7r(-)|| < e? We consider such questions herein. 
We first pause to note some simple properties of total variation distance. 

Propositions, (a) - v 2 (-)\\ = sn Pf-.x^[os] I / fdv\ - J fdv 2 \. 

(b) (•) - v 2 {-)\\ = ^suj> f:X ^ [ab] | J* fdv x - f fdv 2 \ for any a < b, and in 

particular \\vi(-) - v 2 (-)\\ = \ ■sw£>f-.x^\-\A\ I / fd^i-j fdv 2 \. 

(c) If 7r(-) is stationary for a Markov chain kernel P, then ||P"(x, •) — 7r(-)|| is 
non-increasing in n, i.e. \\P n (x, •) — 7r(-)|| < ||P" _1 (a;, •) — 7r(-)|| for n G N. 

(d) More generally, letting {viP)(A) = J Vi(dx) P(x, A), we always have || {v\P){-)- 

M(-)ll<IM-)-^(-)ll- 

(e) Let t(n) = 2 sup^g^ \\P n (x, •) — 7r(-)||, where tt(-) is stationary. Then t is 
sub-multiplicative, i.e. t(m + n) < t(m) t(n) for m,n € N. 

(f) If /i(-) and v(-) have densities g and h, respectively, with respect to some 
a-finite measure p(-), and M = max(g, h) and m = min(g, h), then 



IIMO-K- 



1 / (M - m) dp = 1 - I mdp. 

2 Jx Jx 



(g) Given probability measures ji(-) and v{-), there are jointly defined random 
variables X and Y such that X <~ p(-), Y ~ v(-), and P[X = Y] = 1 — ||/i(-) — 
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Proof. For (a), let p() be any cr-finite measure such that v\ <C p and V2 <C p 
(e.g. p = V1 + V2), and set g = dv\jdp and h = d^/dp. Then | J fdvx — j fdv>2\ — 
I / /(ff — h) dp\- This is maximised (over all < / < 1) when / = 1 on {g > h} 
and / = on {h > g} (or vice- versa), in which case it equals |^i(^4) — ^2(^4)1 
for A = {g > h} (or {g < h}), thus proving the equivalence. 

Part (b) follows very similarly to (a), except now / = b on {g > h} and / = a 
on {g < h} (or vice- versa) , leading to \J fdvi — f fdv%\ = {b — a) \ v\(A) — 

For part (c), we compute that 

\P n+1 {x,A)-n{A)\ -- 



P n ( Xl dy)P(y,A)- / n(dy)P(y,A) 
y ex Jyex 

<\\P n (x,-)-n(-)\\, 



P n (x,dy)f(y)- n(dy)f(y) 
y ex Jyex 

where f(y) = P{y, A), and where the inequality comes from part (a). 
Part (d) follows very similarly to part (c). 

Part (e) follows since t(n) is an L°° operator norm of P n (cf. Meyn and 
Tweedie [54], Lemma 16.1.1). More specifically, let P(x, ■) = P n (x, ■) — ir(-) and 
Q(x, •) = P m (x, ■) - tt(-), so that 

(PQf)(x) = [ f(y) [ [P n (x 1 dz)-7r(dz)][P m (z,dy)-n(dy)] 
Jyex Jzex 

f(y) [P n+m (x, dy) - n(dy) - ir(dy) + ir(dy)} 

yex 

f(y) [P n+m (x,dy)-ic(dy)]. 

yex 

Then let / : X -» [0,1], let g(x) = (Qf)(x) = J yeX Q(x,dy)f(y), and let 
g* = sup^g^ IflOz)!- Then g* < ^t(m) by part (a). Now, if g* = 0, then clearly 
PQf = 0. Otherwise, we compute that 

2 sup \(PQf)(x)\ = 2g* sup \(P[g/g*})(x)\ < t(m) sup(P\g/g*])(x)\. (5) 
xex xex xex 

Since -1 < g/g* < 1, we have (P[g/g*])(x) < 2 \\P n (x, •) - tt(-)|| by part (b), so 
that sup X £ X (P\g / g*])(x) < t(n). The result then follows from part (a) together 
with ©. 

The first equality of part (f) follows since, as in the proof of part (b) with 
a = — 1 and b = 1, we have 

=U [ (g-h)dp+ [ (h-g)dp)=\ [ (M-m)dp 

L y Jg>h Jh>g ' z Jx 



The second equality of part (f) then follows since M + m = g + h, so that 
J X {M + m)dp — 2, and hence 
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±J^M-m)dp = 1-5(2- JjM-m)dp^ 



1 - i / ((Af +m) - (Af - m)) dp = 1 - / mdp. 
2 ./at J* 



For part (g), we let a = J x mdp, b = J x (g — m) dp, and c — J x (h — m) dp. 
The statement is trivial if any of a, b, c equal zero, so assume they are all positive. 
We then jointly construct random variables Z, U, V, I such that Z has density 
m/a, U has density (<? — m)/b, V has density (h — m)/b, and I is independent 
of Z, U, V with P[J = 1] = a and P[I = 0] = 1 - a. We then let X = Y = Z if 
1=1, and X = U and F = V if I = 0. Then it is easily checked that X ~ p(-) 
and y ~ !/(•). Furthermore U and V" have disjoint support, so P[U = V] = 0. 
Then using part (f), 

p[x = y]=p[j = i] = = i - 

as claimed. □ 

Remark. Proposition[3te) is false without the factor of 2. For example, suppose 
X = {1, 2}, with P(l, {1}) = 0.3, P(l, {2}) = 0.7, P(2, {1}) = 0.4, P(2, {2}) = 
0.6, 7r(l) = 4-, and 7r(2) = Then 7r(-) is stationary, and sup^g^ \\P(x, •) — 
tt(-)|| = 0.0636, and sup xeX \\P 2 {x, ■)— t(-)II = 0.00636, but 0.00636 > (0.0636) 2 . 
On the other hand, some authors instead define total variation distance as twice 
the value used here, in which case the factor of 2 in Proposition [3je) is not 
written explicitly. 



3.2. Asymptotic Convergence 

Even if a Markov chain has stationary distribution tt(-), it may still fail to 
converge to stationarity: 

Example 1. Suppose X = {1,2,3}, with tt{1} = tt{2} = tt{3} = 1/3. Let 
P(1,{1}) = P(l,{2}) = P(2,{1}) - P(2,{2}) = 1/2, and P(3, {3}) = 1. 
Then 7r(-) is stationary. However, if Xq = 1, then X n 6 {1,2} for all n, so 
P(X n = 3) = for all n, so P(X n = 3) 7r{3}, and the distribution of X n does 
not converge to 7r(-). (In fact, here the stationary distribution is not unique, and 
the distribution of X n converges to a different stationary distribution defined 
by ^{1} = tt{2} = 1/2.) 

The above example is "reducible" , in that the chain can never get from state 1 
to state 3, in any number of steps. Now, the classical notion of "irreducibility" is 
that the chain has positive probability of eventually reaching any state from any 
other state, but if X is uncountable then that condition is impossible. Instead, 
we demand the weaker condition of 0-irreducibility: 
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Definition. A chain is <fr-irreducible if there exists a non-zero cr-finite measure 
<fi on X such that for all A C X with 4>(A) > 0, and for all x € X, there exists 
a positive integer n = n(x,A) such that P n (x,A) > 0. 

For example, if 4>(A) — S Xt (A), then this requires that has positive prob- 
ability of eventually being reached from any state x. Thus, if a chain has any 
one state which is reachable from anywhere (which on a finite state space is 
equivalent to being indecomposible) , then it is 0-irreducible. However, if X is 
uncountable then often P(x, {y}) = for all x and y. In that case, </>(•) might 
instead be e.g. Lebesgue measure on R d , so that </>({x}) = for all singleton 
sets, but such that all subsets A of positive Lebesgue measure are eventually 
reachable with positive probability from any x € X. 

Running Example. Here we introduce a running example, to which we shall 
return several times. Suppose that 7r(-) is a probability measure having unnor- 
malised density function 7r u with respect to <i-dimensional Lebesgue measure. 
Consider the Metropolis-Hastings algorithm for tt u with proposal density g(x, •) 
with respect to ci-dimensional Lebesgue measure. Then if q(-, •) is positive and 
continuous on R d x R d , and ir u is finite everywhere, then the algorithm is n- 
irreducible. Indeed, let n(A) > 0. Then there exists R > such that it{Aji) > 0, 
where Ar = AD Br(0), and Br(0) represents the ball of radius R centred at 0. 
Then by continuity, for any x 6 R d , infyg^ min{q(x, y), q(y, x)} > e for some 
e > 0, and thus we have (assuming 7r u (x) > 0, otherwise P(x, A) > follows 
immediately) that 

P(x, A) > P(x, A R ) > [ <z(x, y) min 

JA R 

eK 

> e Leb({y E A R : 7r„(y) > tt u (x)}) + — — 7r({y E A R : 7r„(y) < tt u (x)}), 

where K = j x 7r u (x) <ix > 0. Since 7r(-) is absolutely continuous with respect to 
Lebesgue measure, and since Leb(An) > 0, it follows that the terms in this final 
sum cannot both be 0, so that we must have P(x, A) > 0. Hence, the chain is 
7r-irreducible. 

Even </>-irreducible chains might not converge in distribution, due to period- 
icity problems, as in the following simple example. 

Example 2. Suppose again X = {1,2,3}, with tt{1} = n{2} = tt{3} = 1/3. 
Let P(l,{2}) = P(2,{3}) = P(3, {1}) = 1. Then vr(-) is stationary, and the 
chain is 0-irreducible [e.g. with </>(•) = <5i(-)]- However, if Xq = 1 (say), then 
X n = 1 whenever n is a multiple of 3, so P(X n = 1) oscillates between and 1, 
so again P(X n = 1) -f* 7r{3}, and there is again no convergence to 7r(-). 

To avoid this problem, we require aperiodicity, and we adopt the following 
definition (which suffices for the (/(-irreducible chains with stationary distribu- 
tions that we shall study; for more general relationships see e.g. Meyn and 
Tweedie [53], Theorem 5.4.4): 



7rtt(y)o , (y,x) 



7r„(x)g(x,y) 



dy 
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Definition. A Markov chain with stationary distribution ir(-) is aperiodic if 
there do not exist d > 2 and disjoint subsets X\, X%, . . . , C X with 
P(x,X i+ i) = 1 for all x £ Xi (1 < i < d - 1), and P{x,X x ) = 1 for all 
x € Xd, such that tt(Xi) > (and hence ir(Xi) > for all i). (Otherwise, the 
chain is periodic, with period d, and periodic decomposition X\, . . . , Xd-) 

Running Example, Continued. Here we return to the Running Example in- 
troduced above, and demonstrate that no additional assumptions are necessary 
to ensure aperiodicity. To see this, suppose that X\ and X 2 are disjoint subsets 
of X both of positive it measure, with P(x, X%) = 1 for all x 6 X\. But just take 
any x € X\, then since X\ must have positive Lebesgue measure, 

P(x,*i)> f g(x,y)a(x,y)dy >0 

for a contradiction. Therefore aperiodicity must hold. (It is possible to demon- 
strate similar results for other MCMC algorithms, such as the Gibbs sampler, see 
e.g. Tierney [93]. Indeed, it is rather rare for MCMC algorithms to be periodic.) 

Now we can state the main asymptotic convergence theorem, whose proof is 
described in Section |U (This theorem assumes that the state space's cr-algebra 
is countably generated, but this is a very weak assumption which is true for e.g. 
any countable state space, or any subset of R d with the usual Borel cr-algebra, 
since that cr-algebra is generated by the balls with rational centers and rational 
radii.) 

Theorem 4. // a Markov chain on a state space with countably generated a- 
algebra is <p- irreducible and aperiodic, and has a stationary distribution tt{-), 
then for n-a.e. x G X , 

lim \\P n (x,-)-n(-)\\=0. 

n — >oo 

In particular, linin-^oo P n (x, A) = ir(A) for all measurable A C X . 

Fact 5. In fact, under the conditions of Theorem [H if h : X — * R with 
7r(|/i|) < 00, then a "strong law of large numbers" also holds (see e.g. Meyn 
and Tweedie [51], Theorem 17.0.1), as follows: 

n 

lim (1/n) y^h(Xi) = tt(/i) w.p. 1. (6) 

i=l 

Theorem 2] requires that the chain be </>-irreducible and aperiodic, and have 
stationary distribution ir(-). Now, MCMC algorithms are created precisely so 
that 7r(-) is stationary, so this requirement is not a problem. Furthermore, it 
is usually straightforward to verify that chain is ^-irreducible, where e.g. <j> is 
Lebesgue measure on an appropriate region. Also, aperiodicity almost always 
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holds, e.g. for virtually any Metropolis algorithm or Gibbs sampler. Hence, The- 
orem [4] is widely applicable to MCMC algorithms. 

It is worth asking why the convergence in Theorem 0] is just from 7r-a.e. 
x £ X. The problem is that the chain may have unpredictable behaviour on a 
"null set" of 7r-measure 0, and fail to converge there. Here is a simple example 
due to C. Geyer (personal communication): 

Example 3. Let X = {1, 2, . . .}. Let P(l, {1}) = 1, and for x > 2, P(x, {1}) = 
1/x 2 and P(x,{x + 1}) = 1 — (1/x 2 ). Then chain has stationary distribution 
7r(-) = Si(-), and it is 7r-irreducible and aperiodic. On the other hand, if Xq — 
x > 2, then P[X n = x + n for all n] = Uf^xi 1 ~ O-ff)) > °> so that \\ pn ( x , ~ 
7r(-)|| 0. Here Theorem 0] holds for x = 1 which is indeed 7r-a.e. x £ X, but it 
does not hold for x > 2. 

Remark. The transient behaviour of the chain on the null set in Example [3] 
is not accidental. If instead the chain converged on the null set to some other 
stationary distribution, but still had positive probability of escaping the null 
set (as it must to be </>-irreducible), then with probability 1 the chain would 
eventually exit the null set, and would thus converge to w(-) from the null set 
after all. 

It is reasonable to ask under what circumstances the conclusions of The- 
orem U will hold for all x £ X, not just 7r-a.e. Obviously, this will hold if 
the transition kernels P(x, •) are all absolutely continuous with respect to ir(-) 
(i.e., P(x, dy) — p(x, y) ir(dy) for some function p : X x X — > [0, oo)), or for any 
Metropolis algorithm whose proposal distributions Q(x, •) are absolutely contin- 
uous with respect to 7r(-). It is also easy to see that this will hold for our Running 
Example described above. More generally, it suffices that the chain be Harris re- 
current, meaning that for all B C X with n(B) > 0, and all x £ X , the chain will 
eventually reach B from x with probability 1, i.e. P[3 n : X n £ B | Xq = x] = 1. 
This condition is stronger than 7r-irreducibility (as evidenced by Example[3|); for 
further discussions of this see e.g. Orey |61j . Tierney [93], Chan and Geyer [15] . 
and [75]. 

Finally, we note that periodic chains occasionally arise in MCMC (see e.g. 
Neal [58]), and much of the theory can be applied to this case. For example, we 
have the following. 

Corollary 6. If a Markov chain is <j)-irreducible, with period d > 2, and has a 
stationary distribution ir(-), then for ir-a.e. x £ X , 



n+d—i 





and also the strong law of large numbers ([6]) continues to hold without change. 
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Proof. Let the chain have periodic decomposition Xi,...,Xd Q X, and let 
P' be the d-step chain P d restricted to the state space X\. Then P' is <\>- 
irreducible and aperiodic on X\, with stationary distribution tt'(-) which sat- 
isfies that 7r(-) = (1/d) 2j=o( 7r ' ^0(*)- Now, from Proposition [3Jc), it suffices 
to prove the Corollary when n = md with m — > oo, and for simplicity we as- 
sume without loss of generality that x € X±. From Propositioned), we have 
\\P md+ 3(x,-)-(ir' Pi)(-)\\ < ||P md (x,-)-7r'(-)|| for j G N. Then, by the triangle 
inequality, 

vnd+d— 1 rf— 1 rf— 1 

(1/d) £ ^(^■)- 7 r(-)|| = |(l/rf)^P md+J '(^-)-(l/rf)^(7r'P J ')(-) 

i—md j—0 j—0 



< (i/d) 



d-1 <Z-1 

^iip-^-(^-)-(^^')(-)ii<(i/rf)Ei 

J=0 J=0 



p-^O-tt'O)!!. 

||P mrf (.T,-) -7T'(-)|| = 



But applying Theorem [¥] to J", we obtain that lim„, 
for 7r'-a.e. a; G A?i, thus giving the first result. 

To establish (|6]), let P be the transition kernel for the Markov chain on 
X\ X . . . X Xd corresponding to the sequence {(X md , X m d+i, ■ ■ ■ , X md +d-i)}m=0' 
and let h(xo, . . . ,x d -i) = (l/d)(h(xo) + . . . + h(xd-i)). Then just like P' , we 
see that P is (^-irreducible and aperiodic, with stationary distribution given by 



W = tt' x (tt'P) x (tt'P 2 ) X 
Applying Fact [5] to P and h establishes that i 



x (tt'P^ 1 ). 

holds without change. 



□ 



Remark. By similar methods, it follows that ([5]) also remains true in the peri- 
odic case, i.e. that 



lim (1/n) V h(XA = tt(/i) w.p. 1 

n, — too ' * 



whenever h : X — > R with 7r(|/i|) < oo, provided the Markov chain is <f>- 
irreducible and countably generated, without any assumption of aperiodicity. 
In particular, both |(7J) and ([5]) hold (without further assumptions re period- 
icity) for any irreducible (or indecomposible) Markov chain on a finite state 
space. 

A related question for periodic chains, not considered here, is to consider 
quantitative bounds on the difference of average distributions, 



(l/n^P^,-)-^") 

i=i 



through the use of shift- coupling; see Aldous and Thorisson [3J, and [5H] , 
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3.3. Uniform Ergodicity 

Theorem [f] implies asymptotic convergence to stationarity, but does not say 
anything about the rate of this convergence. One "qualitative" convergence rate 
property is uniform ergodicity: 

Definition. A Markov chain having stationary distribution 7r(-) is uniformly 
ergodic if 

\\P n (x,-)-n(-)\\ <Mp n , n= 1,2,3,... 

for some p < 1 and M < oo. 

One equivalence of uniform ergodicity is: 

Proposition 7. A Markov chain with stationary distribution 7r(-) is uniformly 
ergodic if and only if sup xeX ||P™(a;, •) — 7r(-)|| < 1/2 for some n € N. 

Proof. If the chain is uniformly ergodic, then 

lim sup ||P n (a;,-) -tt(-)|| < Hm Mp n = 0, 

n — >oo x ^ x n — *°° 

so sup xeX \\P n (x, ■) — 7r(-)|| < 1/2 for all sufficiently large n. Conversely, if 
sup xeX ||P™(a;, •) — vr(-)|| < 1/2 for some n £ N, then in the notation of Propo- 
sition 02e), we have that d(n) = [3 < 1, so that for all j € N, d(jn) < (d(n)Y — 
ft ' . Hence, from Proposition 02c), 

\\P m (x,-)-7T(-)\\ < \\pW»l»(x r )-*(.)\\<±d(lm/n\n) 

so the chain is uniformly ergodic with M = p- 1 and p = P x l n . □ 

Remark. The above Proposition of course continues to hold if we replace 1/2 
by 8 for any < 5 < 1/2. However, it is false for S > 1/2. For example, if 
X = {1,2}, with P(1,{1}) = P(2,{2}) = 1, and tt(-) is uniform on X, then 
\\P n (x, •) - tt(-)|| = 1/2 for all x € X and n € N. 

To develop further conditions which ensure uniform ergodicity, we require a 
definition. 

Definition. A subset C C X is small (or, (no, e, i/)-small) if there exists a 
positive integer hq, e > 0, and a probability measure v(-) on X such that the 
following minorisation condition holds: 

P no (x,-) > ev(-) xeC, (8) 

i.e. P™° (x, A) > e v{A) for all x £ C and all measurable A C X. 
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Remark. Some authors (e.g. Meyn and Tweedie [51]) also require that C have 
positive stationary measure, but for simplicity we don't explicitly require that 
here. In any case, 7r(C) > follows under the additional assumption of the drift 
condition (TlO|) considered in the next section. 

Intuitively, this condition means that all of the no-step transitions from 
within C, all have an "e-overlap" , i.e. a component of size e in common. (This 
concept goes back to Doeblin [25]; for further background, see e.g. [23], [5], [BO], 
[4j, and [53] : for applications to convergence rates see e.g. [55] , [80], [82], [71] . 
[77] . [24], [85].) We note that if X is countable, and if 

e„ = ^ inf P n °(x,{y})>0, (9) 
ye* 

then C is (n , e„ , f)-small where ^{y} = e" 1 inf^c P n "{x, {y}). (Furthermore, 
for an irreducible (or just indecomposible) and aperiodic chain on a finite state 
space, we always have e„ > for sufficiently large no (see e.g. [81]), so this 
method always applies in principle.) Similarly, if the transition probabilities have 
densities with respect to some measure rj(-), i.e. if P n °(x,dy) = p no (x,y) i](dy), 
then we can take e no = J y£X ( inf^* p no (x, y)) n(cfy). 

Remark. As observed in [72], small-set conditions of the form P(x, •) > ev(-) 
for all x € C, can be replaced by pseudo-small conditions of the form P(x, •) > 
e v X y(') an d P (y, •) > e v xy{') for all x,y £ C, without affecting any bounds which 
use pairwise coupling (which includes all of the bounds considered here before 
Section [5] Thus, all of the results stated in this section remain true without 
change if "small set" is replaced by "pseudo-small set" in the hypotheses. For 
ease of exposition, we do not emphasise this point herein. 

The main result guaranteeing uniform ergodicity, which goes back to Doe- 
blin [22] and Doob [23 and in some sense even to Markov [50], is the following. 

Theorem 8. Consider a Markov chain with invariant probability distribution 
7r(-). Suppose the minorisation condition (|S]) is satisfied for some iiq G N and 
e > and probability measure v{-), in the special case C — X (i.e., the en- 
tire state space is small). Then the chain is uniformly ergodic, and in fact 
\\P n {x,-) - tt(-)|| < (1 - e)L"/«oJ for all x € X, where \r\ is the greatest in- 
teger not exceeding r. 

Theorem [S] is proved in Section 0J We note also that Theorem [5] provides 
a quantitative bound on the distance to stationarity \\P n (x, •) — namely 
that it must be < (1 — e)L"/"°J . Thus, once no and e are known, we can find n* 
such that, say, ||P n *(x, •) — 7r(-)|| < 0.01, a fact which can be applied in certain 
MCMC contexts (see e.g. [78]). We can then say that n* iterations "suffices 
for convergence" of the Markov chain. On a discrete state space, we have that 
||F"(x, •) - 7r(-)|| < (1 - e„ ) Ln/ " oJ with e„ as in ©. 
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Running Example, Continued. Recall our Running Example, introduced 
above. Since we have imposed strong continuity conditions on q, it is natural to 
conjecture that compact sets are small. However this is not true without extra 
regularity conditions. For instance, consider dimension d = 1, and suppose that 
t^u(x) — lo<|x|<i|^| _1 ^ 2 i an d let q(x,y) oc exp{— [x — y) 2 /2}, then it is easy to 
check that any neighbourhood of is not small. However in the general setup 
of our Running Example, all compact sets on which ir u is bounded are small. 
To see this, suppose C is a compact set on which ir u is bounded by k < oo. Let 
x e C, and let D be any compact set of positive Lebesgue and n measure, such 
that inf x6( 7 !ye D q(x, y) = e > for all y G D. We then have, 

P(x, dy) > q(x, y) dy min j 1, j > e dy mm j 1, J , 

which is a positive measure independent of x. Hence, C is small. (This example 
also shows that if tt u is continuous, the state space X is compact, and q is 
continuous and positive, then X is small, and so the chain must be uniformly 
ergodic.) 

If a Markov chain is not uniformly ergodic (as few MCMC algorithms on 
unbounded state spaces are), then Theorem [8] cannot be applied. However, it 
is still of great importance, given a Markov chain kernel P and an initial state 
x, to be able to find so that, say, ||P"*(av) — 7r(-)|| < 0.01. This issue is 
discussed further below. 

3-4- Geometric ergodicity 

A weaker condition than uniform ergodicity is geometric ergodicity, as follows 
(for background and history, see e.g. Nummelin [5D] , and Meyn and Tweedie [S3]): 

Definition. A Markov chain with stationary distribution tt(-) is geometrically 
ergodic if 

\\P n (x,-) -tt(.) II <M{x)p n , n= 1,2,3,... 
for some p < 1, where M(x) < oo for 7r-a.e. x € X . 

The difference between geometric ergodicity and uniform ergodicity is that 
now the constant M may depend on the initial state x. 

Of course, if the state space X is finite, then all irreducible and aperi- 
odic Markov chains are geometrically (in fact, uniformly) ergodic. However, 
for infinite X this is not the case. For example, it is shown by Mengersen and 
Tweedie [S2] (see also [75]) that a symmetric random- walk Metropolis algorithm 
is geometrically ergodic essentially if and only if 7r(-) has finite exponential mo- 
ments. (For chains which are not geometrically ergodic, it is possible also to 
study polynomial ergodicity, not considered here; see Fort and Moulines |29j . 
and Jarner and Roberts [42j.) Hence, we now discuss conditions which ensure 
geometric ergodicity. 
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Definition. Given Markov chain transition probabilities P on a state space X, 
and a measurable function / : X — ► R, define the function Pf : X — > R such 
that (Pf)(x) is the conditional expected value of /(Af„ + i), given that X n = x. 
In symbols, (Pf)(x) = f yeX f(y) P{x, dy). 

Definition. A Markov chain satisfies a drift condition (or, univariate geometric 
drift condition) if there are constants < A < 1 and b < oo, and a function 
V : X — > [1, oo], such that 

PF<Ay + w c , (io) 

i.e. such that P(x, dy)V(y) < XV (x) + bl c (x) for all x G X. 

The main result guaranteeing geometric ergodicity is the following. 

Theorem 9. Consider a (^-irreducible, aperiodic Markov chain with stationary 
distribution 7r(-). Suppose the minorisation condition (j8]) is satisfied for some 
C C X and e > and probability measure v{-). Suppose further that the drift 
condition (|10p is satisfied for some constants < A < 1 and b < oo , and a 
function V : X — * [l,oo] wra£/i V(x) < oo /or at least one (and hence for ir-a.e.) 
x £ X. Then then chain is geometrically ergodic. 

Theorem|9]is usually proved by complicated analytic arguments (see e.g. |60j . 
[54] . [7]). In Section [31 we describe a proof of Theorem [9] which uses direct cou- 
pling constructions instead. Note also that Theorem [9] provides no quantitative 
bounds on M{x) or p, though this is remedied in Theorem 1121 below. 

Fact 10. In fact, it follows from Theorems 15.0.1, 16.0.1, and 14.3.7 of Meyn 
and Tweedie [54] , and Proposition 1 of [69] , that the minorisation condition {8| 
and drift condition (fTOf of Theorem [9] are equivalent (assuming </>-irreducibility 
and aperiodicity) to the apparently stronger property of "l/-uniform ergodicity" , 
i.e. that there is C < oo and p < 1 such that 

sup \P n f(x)-Tr(f)\ <CV(x)p n , XEX, 
\f\<v 

where 7r(/) = J x x f(x)ir(dx). That is, we can take sup|^|<y instead of just 
su Po</<i (compare Proposition [3] parts (a) and (b)), and we can let M(x) — 
CV{x) in the geometric ergodicity bound. Furthermore, we always have 7r(V) < 
oo. (The term "1^-uniform ergodicity" , as used in [54] . perhaps also implies that 
V(x) < oo for all x € X, rather than just for 7r-a.e. x € X, though we do not 
consider that distinction further here.) 
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Open Problem # 1. Can direct coupling methods, similar to those used below 
to prove Theorem^ also be used to provide an alternative proof of Fact \W\ ? 

Example 4. Here we consider a simple example of geometric ergodicity of 
Metropolis algorithms on R (see Mengersen and Tweedie [52] . and [76]). Suppose 
that X = R + and ir u (x) = e~ x . We will use a symmetric (about x) proposal 
distribution q(x, y) = q{\y ~ x\) with support contained in [x — a, x + a]. In this 
simple situation, a natural drift function to take is V(x) — e cx for some c > 0. 
For x > a, we compute: 



PV{x) 



V(y)q(x,y)dy 

px-\-a 



V{y)q{x 7 y)dy — — 
Tr u (x) 



+ V(x) 



q(x, y)dy(l - 7r„(y)/7r u (x)). 



By the symmetry of q, this can be written as 



I(x,y)q(x,y)dy, 



where 



V(y)nu(y) 

ir u (x) 



+ V{2x -y) + V{x) 1 



,(c-l)i 



+ e~ 



1 



7r tt (x) < 
2-(l + e ( c - 1 )")(l 



') 



and where u = y — x. For c < 1, this is equal to 2(1 — e)V(x) for some positive 
constant e. Thus in this case we have shown that for all x > a 



PV(x) < 



x-\-a 



2V(x)(l-e)q(x,y)dy = (l-e)V(x). 



Furthermore, it is easy to show that PV(x) is bounded on [0, a] and that [0, a] 
is in fact a small set. Thus, we have demonstrated that the drift condition (I10| 
holds. Hence, the algorithm is geometrically ergodic by Theorem HI (It turns 
out that for such Metropolis algorithms, a certain condition, which essentially 
requires an exponential bound on the tail probabilities of 7r(-), is in fact necessary 
for geometric ergodicity; sec [76 .) 

Implications of geometric ergodicity for central limit theorems are discussed 
in Section \5\ In general, it believed by practitioners of MCMC that geometric 
ergodicity is a useful property. But does geometric ergodicity really matter? 
Consider the following examples. 
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Example 5. ( 71 J) Consider an independence sampler, with ir(-) an Expo- 
nential(l) distribution, and Q{x, •) an Exponential(A) distribution. Then if < 
A < 1, the sampler is geometrically ergodic, has central limit theorems (see 
Section [5|), and generally behaves fairly well even for very small A. On the other 
hand, for A > 1 the sampler fails to be geometrically ergodic, and indeed for 
A > 2 it fails to have central limit theorems, and generally behaves quite poorly. 
For example, the simulations in [71 indicate that with A = 5, when started 
in stationarity and averaged over the first million iterations, the sampler will 
usually return an average value of about 0.8 instead of 1, and then occasionally 
return a very large value instead, leading to very unstable behaviour. Thus, this 
is an example where the property of geometric ergodicity does indeed correspond 
to stable, useful convergence behaviour. 

However, geometric ergodicity does not always guarantee a useful Markov 
chain algorithm, as the following two examples show. 

Example 6. ("Witch's Hat", e.g. Matthews [51]) Let X = [0, 1], let 6 = lCT 100 
(say), let < a < 1 — S, and let ir u (x) = S + l[ QjQ+ 5](x). Then n([a,a + 
5]) « 1/2. Now, consider running a typical Metropolis algorithm on ir u . Unless 
X 6 [a, a + <5], or the sampler gets "lucky" and achieves X n € [a,a + S] for 
some moderate n, then the algorithm will likely miss the tiny interval [a, a + 5] 
entirely, over any feasible time period. The algorithm will thus "appear" (to the 
naked eye or to any statistical test) to converge to the Uniform^) distribution, 
even though Uniform(A') is very different from 7r(-). Nevertheless, this algorithm 
is still geometrically ergodic (in fact uniformly ergodic). So in this example, 
geometric ergodicity does not guarantee a well-behaved sampler. 

Example 7. Let X = R, and let ir u (x) = 1/(1 + x 2 ) be the (unnormalised) 
density of the Cauchy distribution. Then a random-walk Metropolis algorithm 
for 7r u (with, say, Xq = and Q(x, •) = Uniform[x— 1, x + l]) is ergodic but is not 
geometrically ergodic. And, indeed, this sampler has very slow, poor convergence 
properties. On the other hand, let tt' u (x) — tt u (x) 1\ x \<io 100 i i- e - corresponds 
to tt u truncated at ± one googol. Then the same random-walk Metropolis algo- 
rithm for tt' u is geometrically ergodic, in fact uniformly ergodic. However, the 
two algorithms are indistinguishable when run for any remotely feasible number 
of iterations. Thus, this is an example where geometric ergodicity does not in 
any way indicate improved performance of the algorithm. 

In addition to the above two examples, there are also numerous examples 
of important Markov chains on finite state spaces (such as the single-site Gibbs 
sampler for the Ising model at low temperature on a large but finite grid) which 
are irreducible and aperiodic, and hence uniformly (and thus also geometrically) 
ergodic, but which converge to stationarity extremely slowly. 

The above examples illustrate a limitation of qualitative convergence prop- 
erties such as geometric ergodicity. It is thus desirable where possible to instead 
obtain quantitative bounds on Markov chain convergence. We consider this issue 
next. 
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3.5. Quantitative Convergence Rates 

In light of the above, we ideally want quantitative bounds on convergence rates, 
i.e. bounds of the form \\P n (x, •) — 7r(-)|| < g{x,n) for some explicit function 
g(x,n), which (hopefully) is small for large n. Such questions now have a sub- 
stantial history in MCMC, see e.g. [SS] , [50], , 0S] , [2D], (77], 0^ , [l^ , 
P3], [85], [28], 0, [86], [87]. 

We here present a result from [85] , which follows as a special case of [24] ; it is 
based on the approach of 80J while also taking into account a small improvement 
from [77] . 

Our result requires a bivariate drift condition of the form 



(Thus, P represents running two independent copies of the chain.) Of course, ([TT]) 
is closely related to ([TO]) : for example we have the following (see also [80], and 
Proposition 2 of [57]): 

Proposition 11. Suppose the univariate drift condition (|10[) is satisfied for 
some V : X — > [1, oo], C C X, A < 1, and 6 < oo. Lei d = inf^o V(x). T7ien 
if d > [6/(1 — A)] — 1, then the bivariate drift condition (jlip is satisfied for the 
same C, with h(x, y) = |[V(x) + V(y)] and a -1 = A + 6/(d +!)<!. 



Proof. If (ir,y) ^ C x C, then either x ^ C or y ^ C (or both), so h(x,y) > 
(1 + d)/2, and P7(s) + PV(y) < AV(a;) + XV(y) + b. Then 



Ph(x,y) < h{x,y)/a, 



(x,y)(£CxC 



(11) 



for some function h : X X X — *[l,oo) and some a > 1, where 




Pft(x,y) 



-\PV{x) + PV(y)} < -[XV (x) + XV \y) + b] 

X h(x, y) + 6/2 < A h(x, y) + (b/2)[h(x, y)/((l + d)/2)] 
[X + b/(l + d)]h(x,y). 



Furthermore, d > [6/(1 - A)] - 1 implies that A + 6/(1 + d) < 1. 



□ 



Finally, we let 



B no = max 1, a n °(l — e) sup Rh , 

L CxC 



(12) 



where for (x, y) £ C x C, 



Rh(x,y) 



[ [ (1 - e)~ 2 /i(z, to) (P"° (x, dz) - ev{dz)) (P no (y, dw) - tv{dw)). 



Jx Jx 



In terms of these assumptions, we state our result as follows. 
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Theorem 12. Consider a Markov chain on a state space X , having transition 
kernel P. Suppose there is C C X , h : X x X — > [1, oo), a probability distribution 
v{-) on X , a > 1, no G N, and e > 0, such that ([5]) and hold. Define B no 
by (|12[) . TTien /or any jom£ initial distribution £(X a , Xq), and any integers 
1 < j < fe, and {^^} are ^ wo copies of the Markov chain started in the 

joint initial distribution C(Xq, Xq), then 

\\C(X k ) - C(X' k )\\ TV < (1 - e )i + a- k (B no ) J_1 E[fc(X , 

in particular, by choosing j — \rk\ for sufficiently small r > 0, we obtain an 
explicit, quantitative convergence bound which goes to exponentially quickly as 
k — > oo. 

Theorem[12]is proved in Section[4] Versions of this theorem have been applied 
to various realistic MCMC algorithms, including for versions of the variance 
components model described earlier, resulting in bounds like ||P"(a;, •) — 7r(-)|| < 
0.01 for n = 140 or n = 3415; see e.g. [82], and Jones and Hobert [45]. Thus, 
while it is admittedly hard work to apply Theorem [T^] to realistic MCMC al- 
gorithms, it is indeed possible and often can establish rigorously that perfectly 
feasible numbers of iterations are sufficient to ensure convergence. 

Remark. For complicated Markov chains, it might be difficult to apply Theo- 
rcm[T2lsuccessfullv. In such cases, MCMC practitioners instead use "convergence 
diagnostics", i.e. do statistical analysis of the realised output X ll X 2 , ■ ■ ., to see 
if the distributions of X n appear to be "stable" for large enough n. Many such 
diagnostics involve running the Markov chain repeatedly from different initial 
states, and checking if the chains all converge to approximately the same dis- 
tribution (see e.g. Gelman and Rubin [31], and Cowles and Carlin [IB]). This 
technique often works well in practice. However, it provides no rigorous guar- 
antees and can sometimes be fooled into prematurely claiming convergence (see 
e.g. [51]), as is likely to happen for the examples at the end of Section [3] Fur- 
thermore, convergence diagnostics can also introduce bias into the resulting 
estimates (see [E]). Overall, despite the extensive theory surveyed herein, the 
"convergence time problem" remains largely unresolved for practical application 
of MCMC. (This is also the motivation for "perfect MCMC" algorithms, orig- 
inally developed by Propp and Wilson [63] and not discussed here; for further 
discussion see e.g. Kendall and M0ller [36], Thonnes [92], and Fill et al. [27].) 

4. Convergence Proofs using Coupling Constructions 

In this section, we prove some of the theorems stated earlier. There are of course 
many methods available for bounding convergence of Markov chains, appropriate 
to various settings (see e.g. p], [21], [88], [2], [90], and Subsection 15.41 herein), 
including the setting of large but finite state spaces that often arises in computer 
science (see e.g. Sinclair [5S] and Randall [M]) but is not our emphasis here. In 
this section, we focus on the method of coupling, which seems particularly well- 
suited to analysing MCMC algorithms on general (uncountable) state spaces. 
It is also particularly well-suited to incorporating small sets (though small sets 
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can also be combined with regeneration theory, see e.g. [8], @], [57], [35]). Some 
of the proofs below are new, and avoid many of the long analytic arguments of 
some previous proofs (e.g. Nummelin [60] . and Meyn and Tweedie [54] ) . 

4- 1. The Coupling Inequality 

The basic idea of coupling is the following. Suppose we have two random vari- 
ables X and Y, defined jointly on some space X. If we write C(X) and C{Y) 
for their respective probability distributions, then we can write 

||£(X)-£(Y)|| = sup|P(Xe A)-P(Y £ A)\ 

A 

= sup\P{X £ A,X = Y) + P(X e A,X ^Y) 

A 

- P(Y £ A, Y = X) - P(Y £A,Y^ X)\ 
= sup\P(X £ A,X ^Y)- P(Y £ A,Y £ X)\ 

A 

< P(X^Y), 

so that 

\\C(X) - C(Y)\\ <P(X^Y). (13) 

That is, the variation distance between the laws of two random variables is 
bounded by the probability that they are unequal. For background, see e.g. 
Pitman [5J, Lindvall gSJ, and Thorisson [5T]. 

4-2. Small Sets and Coupling 

Suppose now that C is a small set. We shall use the following coupling con- 
struction, which is essentially the "splitting technique" of Nummelin [59] and 
Athreya and Ney [8] ; see also Nummelin [60] , and Meyn and Tweedie [54] . The 
idea is to run two copies {X n } and {X' n } of the Markov chain, each of which 
marginally follows the updating rules P(x, •), but whose joint construction (us- 
ing C) gives them as high a probability as possible of becoming equal to each 
other. 

THE COUPLING CONSTRUCTION: 

Start with Xq = x and X' ~ tt(-), and n = 0, and repeat the following loop 
forever. 

Beginning of Loop. Given X n and X' n : 

1. If X n = X' n , choose X n+ i = X' n+l ~ P(X n , •), and replace n by n + 1. 

2. Else, if {X ni X' n ) eCxC, then: 

(a) w.p. e, choose X n+na = X' n+riQ - v(-); 

(b) else, w.p. 1 — e, conditionally independently choose 

X n+no ~-!—lP n °(X n , -)-eu(-)], 
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XU no ~— e [F no (K,-)-ev(>)]. 

In the case uq > 1, for completeness go back and construct X n+ i, . . . , X n+n , 
from their correct conditional distributions given X n and X n+no , and 
similarly (and conditionally independently) construct X' n+1 , . . . , X' n+n _ 1 
from their correct conditional distributions given X' n and X' n+no . In any 
case, replace n by n + no. 

3. Else, conditionally independently choose X n+ i ~ P(X n ,-) and ~ 
P(X^, ■), and replace n by n + 1. 

Then return to Beginning of Loop. 



Under this construction, it is easily checked that X n and X' n are each 
marginally updated according to the correct transition kernel P. It follows that 
P[X n G A] = P n (x, •) and P[X' n G A] = tt(A) for all n. Moreover the two chains 
are run independently until they both enter C at which time the minorisation 
splitting construction (step 2) is utilised. Without such a construction, on un- 
countable state spaces, we would not be able to ensure successful coupling of 
the two processes. 

The coupling inequality then says that ||P™(a;, •) — 7r(-)|| < P[X„ ^ X' n ]. The 
question is, can we use this to obtain useful bounds on ||P™(a;, •) — 7r(-)||? In 
fact, we shall now provide proofs (nearly self-contained) of all of the theorems 
stated earlier, in terms of this coupling construction. This allows for intuitive 
understanding of the theorems, while also avoiding various analytic technicalities 
of the previous proofs of some of these theorems. 

4.3. Proof of Theorem^ 

In this case, C = X , so every no iterations we have probability at least e of 
making X n and X' n equal. It follows that if n = nam, then P[X„ ^ X^j < 
(1 — e) m . Hence, from the coupling inequality, ||P n (:r, •) — 7r(-)|| < (1 — e) m — 
(1 — e) n /™° in this case. It then follows from Proposition [3](c) that ||P™(a;, ■) — 
tt(-) II < (1 - e)L"/«oJ for any n. □ 

4.4. Proof of Theorem \M 

We follow the general outline of [85J . We again begin by assuming that no = 1 
in the minorisation condition for the small set C (and thus write B no as B), 
and indicate at the end what changes are required if no > 1. 
Let 

N k = #{m : < m < k, (X m ,X' m ) £ C x C}, 

and let Ti,r 2 , ... be the times of the successive visits of {(X n ,X' n )} to C x C. 
Then for any integer j with 1 < j < k, 



P[X k X' k ] = P[X k ? X' k , N k ^ > j] + P[X k X' k , N k -! < j}. (14) 
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Now, the event {X k ^ XL, Nk-i > j} is contained in the event that the 
first j coin flips all came up tails. Hence, P[X k ^ X' k , Nk-i > j] < (1 — e) J . 
which bounds the first term in (fTi)) . 

To bound the second term in (fT4|) . let 

M fc = a k B- N -- 1 h{X k ,X' k )l{X k + X' k \ k = 0,1,2,... 

(where iV_i = Q). 

Lemma 13. JVe Ziawe 

E[M fe+1 | X ,...,Xfc,X£,...,X£] <M k , 

i.e. {M k } is a supermartingale. 

Proof. If (X k , X' k ) i C x C, then N k = N k -i, so 

E[M fc+ i | X , . . . , Xfe, Xg, . . . , X k ] 

= a k+1 B- N »-^[h{Xk + i,X' k+1 )l{X k+1 ^X'k +1 ) | X k ,X' k ] 
(since our coupling construction is Markovian) 

< a k+1 B- N ^E[h(X k+1> X k+1 )\X k ,X k ]l(X k ^X k ) 
= M k a-E[h(X k+1 ,X' k+1 ) | X fe ,X£]/ft(X fe ,X£) 

< M fc! 

by (HJ). Similarly, if (Xfc,X£) € C x C, then AT*. = iVfc_i + 1, so assuming 
Xfe ^ X^, (since if Xk = X' k , then the result is trivial), we have 

E[M fe+ i | Xq, . . . ,X k ,X , . . . ,X k ] 

= a k+1 B- N ^- 1 E[h(X k+1 ,X' k+1 )l(X k+1 ± X' k+1 ) \ X k .X' k ] 

= a k+l B- N ^-\l-e){Rh){X k ,X' k ) 

= M k aB~ l {l - e){Rh)(X k> X' k )/h(X k ,X' k ) 

< M k , 

by (fT0|) . Hence, {Mk} is a supermartingale. □ 
To proceed, we note that since B > 1, 

P[X fe ^ X£, JV fc _i < i] = P[X k ? X'k, Nk-i < j - 1] 

< P[X fe ^ X[„ fl-^-i > fl-tf-D] 
= P[l(X fe ^ X(.) B~ Nk ~ 1 > B-V-V] 

< B^^mXk ^ X' k ) B- Nk - 1 ] (by Markov's inequality) 

< B^E^X* ^ Xfe) B- N *- 1 h{X k ,X' k )] (since ft > 1) 
= a- k B j ~ 1 'E[M k ] (by defn of M fe ) 

< a^^B^'EfMo] (since {M,t} is supermartingale) 
= a~ k B : >~ 1 E[h(Xo, Xq)} (by defn of M ). 
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Theorem ll2l now follows (in the case uq = 1), by combining these two bounds 
with (Ull) and (HU). 

Finally, we consider the changes required if n > 1. In this case, the main 
change is that we do not wish to count visits to C x C during which the joint 
chain could not try to couple, i.e. visits which correspond to the "filling in" 
times for going back and constructing X n+ i, . . . ,X n+no [and similarly for X'] 
in step 2 of the coupling construction. Thus, we instead let Nj~ count the number 
of visits to C x C, and {t^} the actual visit times, avoiding all such "filling in" 
times. Also, we replace Nk-i by Nk- no in (f!4| and in the definition of M^. 
Finally, what is a supermartingale is not {Afk} but rather {M t ^}, where t(k) 
is the latest time < k which does not correspond to a "filling in" time. (Thus, 
t(k) will take the value k, unless the joint chain visited C x C at some time 
between k — uq and k — 1.) With these changes, the proof goes through just as 
before. □ 



4-5. Proof of Theorem^ 

Here we give a direct coupling proof of Theorem [9l thereby somewhat avoiding 
the technicalities of e.g. Meyn and Tweedie [54] (though admittedly with a 
slightly weaker conclusion; see Fact [T0| . Our approach shall be to make use of 
Theorem IT21 To begin, set h(x,y) — ^[V(x) + V(y)]. Our proof will use the 
following technical result. 

Lemma 14. We may assume without loss of generality that 

sup V(x) < oo. (15) 

Specifically, given a small set C and drift function V satisfying ([5]) and (I10[) . we 
can find a small set Cq C C such that ([5]) and (JTUJ) still hold ( with the same no 
and e and b, but with A replaced by some Xq < 1 ), and such that (j 1 5[) also holds. 

Proof. Let A and b be as in (fit)]) . Choose 5 with < S < 1 — A, let Ao = 1 — 5, 
let K = 6/(1 - A - 5), and set 

Co = C*n {x S X : V(x) < K}. 

Then clearly © continues to hold on Co, since Co C C. It remains to verify 
that (fit)]) holds with C replaced by Co, and A replaced by Ao- Now, (JTUJ) clearly 
holds for x € Co and x ^ C, by inspection. Finally, for x E C \ Co, we have 
V{x) > K, and so using the original drift condition (fTUj) . we have 

(PV){x) < XV(x) +bl c (x) = (1- S)V(x) - (1 - A - 5)V(x)+b 

< (1 - S)V(x) - (1 - A - §)K + b = (1 - S)V(x) = A V(x), 
showing that (fTU|) still holds, with C replaced by Cq and A replaced by Ao- □ 
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As an aside, we note that in LemmarjJJ it may not be possible to satisfy (115[) 
by instead modifying V and leaving C unchanged: 

Proposition 15. There exists a geometrically ergodic Markov chain, with small 
set C and drift function V satisfying (|8|) and (|10j) , such that there does not exist 
a drift function Vq : X — > [0, oo] with the property that upon replacing V by 
Vo, (O and (|10j) continue to hold, and (|15|) also holds. 

Proof. Consider the Markov chain on X — (0, oo), defined as follows. For x > 2, 
P(x, •) = S x ^i(-), a point-mass at x — 1. For 1 < x < 2, P{x,-) is uniform 
on [1/2, 1]. For < x < 1, P{x,-) = ^A(-) + \ 8h(x){-)i where A is Lebesgue 
measure on (0, 1), and h(x) = 1 + ^/log(l/x). 

For this chain, the interval C — (0, 1) is clearly (1, 1/2, A)-small. Further- 
more, since L -\/log(l/a;) dx = \/tt/2 < oo, the return times to C have finite 
mean, so the chain has a stationary distribution by standard renewal theory 
arguments (see e.g. Asmussen [4]). In addition, with drift function V(x) = 
max(e ;r , a; -1 / 2 ), we can compute (PV)(x) explicitly, and verify directly that 
(say) PV(x) < 0.8 V(x) + 41 c (x) for all x € X, thus verifying dTUJ) with 
A = 0.8 and 6 = 4. Hence, by Theorem |9l the chain is geometrically er- 
godic. 

On the other hand, suppose we had a some drift function Vo satisfying (fTQ|) , 
such that sup^gp Vq(x) < oo. Then since PVq(x) — ^Vo(h(x)) + \ f Q Vo(y) dy, 
this would imply that svp xeC Vo({h(x)) < oo, i.e. that Vo(h(x)) is bounded for 
all < x < 1, which would in turn imply that Vo were bounded everywhere on 
X. But then Fact 1101 would imply that the chain is uniformly ergodic, which it 
clearly is not. This gives a contradiction. □ 

Thus, for the remainder of this proof, we can (and do) assume that (|15| 
holds. This, together with (JTUJ) , implies that 

sup Rh(x,y) < oo, (16) 

{x,y)eCxC 

which in turn ensures that the quantity B no of p2[) is finite. 

To continue, let d = inf^c V. Then we see from Proposition [11] that the 
bivariate drift condition pip will hold, provided that d > 6/(1 — A) — 1. In 
that case, Theorem [9] follows immediately (in fact, in a quantitative version) by 
combining Proposition [TT1 with Theorem [T2l 

However, if d < 6/(1 — A) — 1, then this argument does not go through. This is 
not merely a technicality; the condition d > 6/(1 — A) — 1 ensures that the chain 
is aperiodic, and without this condition we must somehow use the assumption 
aperiodicity more directly in the proof. 

Our plan shall be to enlarge C so that the new value of d satisfies d > 
6/(1 — A) — 1, and to use aperiodicity to show that C remains a small set (i.e., 
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that © still holds though perhaps for uncontrollably larger no and smaller 
e > 0). Theorem [9] will then follow from Proposition [TT1 and Theorem [T"2l as 
above. (Note that we will have no direct control over the new values of Hq and 
C, which is why this approach does not provide a quantitative convergence rate 
bound.) 

To proceed, choose any d! > 6/(1 — A) — 1, let S = {x e X; V(x) < d}, 
and set C = CUS. This ensures that inf xe c'° V(x) > d! > 6/(1 - A) — 1. 
Furthermore, since V is bounded on S by construction, we see that (fT5| will 
still hold with C replaced by C . It then follows from (fT6|) and (fTQ| that we will 
still have B no < oo even upon replacing C by C . Thus, Theorem |9] will follow 
from Proposition II II and Theorem fTS] if we can prove: 

Lemma 16. C is a small set. 

To prove Lemma [THl we use the notion of "petite set" , following [S3] . 

Definition. A subset C C X is petite (or, (no, e, z/)-petite), relative to a small 
set C, if there exists a positive integer n$, e > 0, and a probability measure u{-) 
on # such that 

no 

J^Cav) > eKO ^eC. (17) 

i=l 

Intuitively, the definition of petite set is like that of small set, except that 
it allows the different states in C to cover the minorisation measure ev{-) at 
different times i. Obviously, any small set is petite. The converse is false in 
general, as the petite set condition does not itself rule out periodic behaviour of 
the chain (for example, perhaps some of the states x € C cover e v(-) only at odd 
times, and others only at even times). However, for an aperiodic, 4>- irreducible 
Markov chain, we have the following result, whose proof is presented in the 
Appendix. 

Lemma 17. (Meyn and Tweedie 1541, Theorem 5.5.7) For an aperiodic, 
<f> -irreducible Markov chain, all petite sets are small sets. 

To make use of Lemma |T7l we use the following. 

Lemma 18. Let C = C U S where S = {x e X; V(x) < d} for some d < oo, 
as above. Then C is petite. 

Proof. To begin, choose N large enough that r = 1 — X N d > 0. Let tq = 
inf{n > 1; X„ <E C} be the first return time to C. Let Z n = \~ n V(X n ), 
and let W n = Z mln ^ n Tc y Then the drift condition (jTUJ) implies that W n is a 
supermartingale. Indeed, if tq < n, then 

E[W„+i | Xq, Xi,..,, X n ] = E[Z Tc | Xq, Xi, . . . , X n ] = Z Tc = W n , 
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while if tc > n, then X n ^ C, so using (|10|) . 

E[W n+1 \X ,X u ...X n ] = \-^ +1 \PV)(X n ) 

< X-^+^XViXn) 
= X n V(X n ) 
= W n . 

Hence, for x <E S, using Markov's inequality and the fact that V > 1, 

P[r c > TV | X = x] = P[\- Tc > \~ N | X = x] 

< X N E[A~ TC | X = x] < X N E[Z TC . \X Q = x] 

< \ N E[Z | X Q = x] = X N V(x) < X N d, 

so that P[t c < N | X Q = x] > r. 

On the other hand, recall that C is (no, e, z/(-))-small, so that P n °(x,-) > 
ev{-) for xeC.lt follows that for ES+™ ^(^i > rei/(-). Hence, for 

xeSUC, J2?=n7 pi ( x > ') > Th is shows that S 1 U C is petite. □ 

e 

Combining Lemmas [T5] and [T71 we see that C must be small, proving 
Lemma [TBI and hence proving Theorem [9] □ 

4-6. Proof of Theorem^ 

Theorem @] does not assume the existence of any small set C, so it is not clear 
how to make use of our coupling construction in this case. However, help is at 
hand in the form of a remarkable result about the existence of small sets, due 
to Jain and Jameson [IT] (see also Orey [ST]). We shall not prove it here; for 
modern proofs see e.g. [BU], p. 16, or [51] . Theorem 5.2.2. The key idea (see e.g. 
Meyn and Tweedie [53], Theorem 5.2.1) is to extract the part of P n °(x, ■) which 
is absolutely continuous with respect to the measure 0, and then to find a C 
with <j)(C) > such that this density part is at least 5 > throughout C. 

Theorem 19. (Jain and Jameson RIV) Every (^-irreducible Markov chain, on 
a state space with countably generated a-algebra, contains a small set CCA" 
with 4>(C) > 0. (In fact, each B C X with <p(B) > in turn contains a small 
set C C B with <f>{C) > 0.) Furthermore, the minorisation measure v(-) may be 
taken to satisfy v{C) > 0. 

In terms of our coupling construction, if we can show that the pair (X n , X' n ) 
will hit C x C infinitely often, then they will have infinitely many opportunities 
to couple, with probability > e > of coupling each time. Hence, they will 
eventually couple with probability 1, thus proving Theorem [4] 

We prove this following the outline of [84 . We begin with a lemma about 
return probabilities: 
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Lemma 20. Consider a Markov chain on a state space X, having stationary 
distribution n(-). Suppose that for some A C X , we have P x {ta < oo) > for 
all x £ X . Then for 7T- almost- every x £ X , P x {ta < oo) = 1. 

Proof. Suppose to the contrary that the conclusion does not hold, i.e. that 

n{x £ X : P x (t a = oo) > 0} > 0. (18) 

Then we make the following claims (proved below): 

Claim 1. Condition (fP8|) implies that there are constants l,^6N,iS> 0, and 
B C X with n(B) > 0, such that 

P x (t a = oo, sup{fc > 1; X Uo £B}<£)>6, x£B. 

Claim 2. Let B, I, £q, and S be as in Claim[TJ Let L = ££q, and let S — sup{fc > 
1; XkL € B}, using the convention that S = — oo if the set {k > 1; XkL € B} 
is empty. Then for all integers 1 < r < j, 

[ TT(dx) P x [S = r, X jL $ A] > tt(B) 6. 

Assuming the claims, we complete the proof as follows. We have by station- 
arity that for any j € N, 

tt(A c ) = / 7r(dx)P jL (x,A c )= [ n(dx)P x [X jL £ A] 

r=1 Jxex r=1 

For j > l/ir(B)5, this gives tt(A c ) > 1, which is impossible. This gives a 
contradiction, and hence completes the proof of Lemmal^tH subject to the proofs 
of Claims [1] and [2] below. □ 

Proof of Claim [TJ By (IT8| . we can find 8\ and a subset B\ C X with 7r(i3i) > 
0, such that P x {ta < oo) < 1 — 8i for all x 6 B\. On the other hand, since 
Px(jA < oo) > for all x £ X, we can find £o € N and ^2 > and Bi C B\ 
with tt(B 2 ) > and with P e °(x, A) > S 2 for all x € B 2 . 

Set 77 = #{fc > 1; Xki € i?2}- Then for any r G N and 1 £ we have 
Px{t~a = 00, ?7 = r) < (1 — 82Y ■ In particular, P 2; (t j 4 = 00, r\ = 00) = 0. Hence, 
for 1 € B 2 , we have 

P x (ta = 00, 77 < 00) = 1 - P x (t a =00, 77 = 00) - P x (t a < 00) 

> l-0 + (l-^i) = <5i. 

Hence, there is £ £ N, <5 > 0, and B C S 2 with 7r(B) > 0, such that 



Pz (ta = 00, sup{fc > 1; X Mo G B 2 } <£)>S, x £ B. 
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Finally, since B C B 2 , we have sup{fc > 1; Xm £ B2} > sup{fc > 1; 
Xu S B}, thus establishing the claim. □ 

Proof of Claim [2j We compute using stationarity and then Claim [T] that 
f n(dx)P x [S = r, X ]L £A] 

J xeX 

= [ TT(dx) ( P rL (x,dy)P y [S = -^ X u _ r)L ^A] 
J xex JyeB 

= [ f ir(dx)P rL (x, dy) P y [S = -00, X {j _ r)L £ A] 
J yeB J xex 

= / n{dy) P y [S = -co, X(j_ r)L $ A] 
J v eB 

> [ ir{dy)5 = ir{B)S. □ 
J v eB 

To proceed, we let C be a small set as in Theorem Q1J1 Consider again the 
coupling construction {(X n ,Y n )}. Let G C X x X be the set of (x,y) for 
which P/ X y^(3n > 1; X n = y„) = 1. From the coupling construction, we 
see that if (X ,X' a ) = (x,X' ) e G, then lim n ^oo P[X n = X' n ] = 1, so that 
linin^oo ||P n (a;,-) — ""(Oil = 0) proving Theorem |4] Hence, it suffices to show 
that for 7r-a.e. x 6 X, we have P[(x,Xq) 6 G] = 1. 

Let G be as above, let = {y e A"; (x, y) e G} for a; € X, and let 
G = {x e X; ir{Gx) = 1}. Then Theorem H follows from: 

Lemma 21. tt(G) = 1. 

Proof. We first prove that (tt x 7t)(G) = 1. Indeed, since v{C) > by Theo- 
rem [T21 it follows from Lemma 1351 that, from any (x,y) G X x X, the joint 
chain has positive probability of eventually hitting G x G. It then follows 
by applying Lemma [20] to the joint chain, that the joint chain will return to 
G x G with probability 1 from (tt x 7r)-a.e. (x, y) ^ G x G. Once the joint 
chain reaches G x G, then conditional on not coupling, the joint chain will 
update from R which must be absolutely continuous with respect to tt x tt, 
and hence (again by Lemma I20[) will return again to G x G with probabil- 
ity I. Hence, the joint chain will repeatedly return to G x G with proba- 
bility 1, until such time as X n — X' n . And by the coupling construction, 
each time the joint chain is in G x G, it has probability > e of then forc- 
ing X n — X' n . Hence, eventually we will have X n = X' n , thus proving that 
(tt x tt)(G) = 1. 
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Now, if we had tt(G) < 1, then we would have 



(7TX7r)(G°)= / Tr(dx)Tr(G^ ) = / ir(dx)[l - ir(G x )] > 0, 
Jx Jg c 

contradicting the fact that (ir x tt)(G) = 1. 



□ 



5. Central Limit Theorems for Markov Chains 

Suppose {X n } is a Markov chain on a state space X which is ^-irreducible and 
aperiodic, and has a stationary distribution 7r(-). Assume the chain begins in 
stationarity, i.e. that Xq ~ tt(-)- Let h : X — > R be some functional with finite 
stationary mean n(h) = J xeX h(x) n(dx). 

We say that h satisfies a Central Limit Theorem (CLT) (or, -^/n-CLT) if 
there is some a 2 < oo such that the normalised sum n^ 1 / 2 ^ILiIM^Q) — n W} 
converges weakly to a iV(0, a 2 ) distribution. (We allow for the special case 
a 2 = 0, corresponding to convergence to the constant 0.) It then follows (see 
e.g. Chan and Geyer [IS]) that 



2 I.. 
a 2 = lim -E 

71— »00 fl 



(19) 



and also a 2 = r Var w (/i), where r = X^feez Corr(Xo,Xfc) is the integrated auto- 
correlation time. (In the reversible case this is also related to spectral measures; 
see e.g. [47], [34], [69].) Clearly a 2 < oo requires that Var w (/i) < oo, i.e. that 
7r(/i 2 ) < oo. 

Such CLTs are helpful for understanding the errors which arise from Monte 
Carlo estimation, and are thus the subject of considerable discussion in the 
MCMC literature (e.g. pH], [S3], [TS], [53, [75], [3H], [B], g3J). 



5. J. A Negative Result 

One might expect that CLTs always hold when Tr(h 2 ) is finite, but this is false. 
For example, it is shown in JBj>] that Metropolis-Hastings algorithms whose 
acceptance probabilities are too low may get so "stuck" that r = oo and they 
will not have a -^/n-CLT. More specifically, the following is proved: 

Theorem 22. Consider a reversible Markov chain, beginning in its stationary 
distribution n(-), and let r(x) = P[X n+ i = X n \ X n = x]. Then if 

lim n ir([h-ir{h)} 2 r n ) = oo, (20) 

n — *oo 

then a ^Jli-CLT does not hold for h. 
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Proof. We compute directly from (fTO)) that 



a' 



.2 



> 



n — >oo fi 



i r / " 

lim -E Y[h(Xi) 

L \ i=l 

1 \{ - 

L \ i=l 



lim -E n[/i(X ) - 





lim U7r([/i — ""(^)] 2 ?"' 



= oo, 



by ([2"Uj) . Hence, a y^n-CLT cannot exist. 



□ 



In particular, Theorem [221 is used in [66j to prove that for the independence 
sampler with target Exp(l) and i.i.d. proposals Exp(A), the identity function 
has no -^/n-CLT for any A > 2. 

The question then arises of what conditions on the Markov chain transitions, 
and on the functional h, guarantee a i/n-CLT for h. 

5.2. Conditions Guaranteeing CLTs 

Here we present various positive results about the existence of CLTs. Some, 
though not all, of these results are then proved in the following two sections. 

For i.i.d. samples, classical theory guarantees a CLT provided the second 
moments are finite (e.g. |13j . Theorem 27.1; [83] . p. 110). For uniformly ergodic 
chains, an identical result exists; it is shown in Corollary 4.2(h) of Cogburn [T7] 
(cf. Theorem 5 of Tierney [93]) that: 

Theorem 23. If a Markov chain with stationary distribution is uniformly 
ergodic, then a y/n-CLT holds for h whenever ir(h 2 ) < oo. 

If a chain is just geometrically ergodic but not uniformly ergodic, then a 
similar result holds under the slightly stronger assumption of a finite 2 + 6 
moments. That is, it is shown in Theorem 18.5.3 of Ibragimov and Linnik [ID] 
(see also Theorem 2 of Chan and Geyer [TS], and Theorem 2 of Hobert et al. [55] ) 
that: 

Theorem 24. If a Markov chain with stationary distribution 7r(-) is geomet- 
rically ergodic, then a ^/n-CLT holds for h whenever 7r(|/i| 2+A ) < oo for some 



It follows, for example, that the independence sampler example mentioned 
above (which fails to have a y^n-CLT, but which has finite moments of all orders) 
is not geometrically ergodic. 

It is shown in Corollary 3 of [SpJ that Theorem [2U can be strengthened if 
the chain is reversible: 



8 > 0. 
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Theorem 25. If the Markov chain is geometrically ergodic and reversible, then 
a y/n-CLT holds for h whenever ir(h 2 ) < oo. 

Comparing Theorems[25l and [24l leads to the following yes-or-no question (see 
[B]): if a Markov chain is geometrically ergodic, but not necessarily reversible, 
and 7r(/i 2 ) < oo, then does a y/n-CLT necessarily exist for hi In the first draft 
of this paper, we posed that question as an Open Problem. However, it was 
recently solved by Haggstrom [35], who produced a counter-example to prove 
the following: 

Theorem 26. There exists a (non-reversible) geometrically ergodic Markov 
chain, on a (countable) state space X , and a function h : X — > R, such that 
7r(/i 2 ) < oo 7 but such that h does not satisfy a ^Jn-CLT (nor a CLT with any 
other scaling). 

If P is reversible, then it was proved by Kipnis and Varadhan [47 that 
finiteness of a 2 is all that is required: 

Theorem 27. For a 4>- irreducible and aperiodic Markov chain which is re- 
versible, a y/n-CLT holds for h whenever a 2 < oo, where a 2 is given by (|19[) . 

In a different direction, we have the following: 

Theorem 28. Suppose a Markov chain is geometrically ergodic, satisfying H0\) 
for some V : X — ► [1, oo] which is finite ir-a.e. Let h : X — ► R with h 2 < KV 
for some K < oo. Then a ^/n-CLT holds for h. 

Before proving some of these results, we consider two extensions which are 
straightforward mathematically, but which may be of practical importance. 

Proposition 29. The above CLT results (i.e., Theorems [M [M and 
\28\) all remain true if instead of beginning with X ~ 7r(-), as above, we begin 
with Xq = x, for n-a.e. x G X. 

Proof. The hypotheses of the various CLT results all imply that the chain is 
</>-irreducible and aperiodic, with stationary distribution 7r(-). Hence, by Theo- 
rem |31 there is convergence to tt(-) from 7r-a.e. x € X. For such x, let e > 0, 
and find m G N such that ||P m (a;, •) — 7r(-)|| < e. It then follows from Propo- 
sition [3Jg) that we can jointly construct copies {X n } and {X^} of the Markov 
chain, with Xq = x and Xq ~ 7r(-), such that 

P[X n = X' n for all n > m] > 1 - e. 
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But this means that for any A C X, 



lim sup 



71 11 

•(n-V^MXO-TrW] G ^) - P(V 1/2 5>(*I)-^)] G A 



i=l 



< e. 



Since e > is arbitrary, and since we know that n x / 2 Y^h=i lA(-^i) — 71 W] 
converges in distribution to N(0, a 2 ), hence so does rT 1 ! 2 — 

□ 

Proposition 30. The CLT Theorems H3 and ^24\ remain true if the chain is 
periodic of period d > 2, provided that the d-step chain P' = P d \ x (as in the 
proof of Corollary [6]) has all the other properties required of P in the original 
result (i.e. <f>-irreducibility, and uniform or geometric ergodicity), and that the 
function h still satisfies the same moment condition. 

Proof. As in the proof of Corollary [5] let P be the (i-step chain defined on 
X\ x . . . x Xd, and h(xo, . . . , Xd-i) — h(xo) + . . . + h(xd-x)- Then P inherits the 
irreducibility and ergodicity properties of P 1 (formally, since P' is de-initialising 
for P; see [73]). Then, Theorem[23lor[24lestablishes a CLT for P and h. However, 
this is easily seen to be equivalent to the corresponding CLT for the original P 
and h, thus giving the result. □ 

Remark. In particular, combining Theorem [23] with Proposition [30] we see 
that a y/n-CLT holds for any function h for any irreducible (or indecomposible) 
Markov chain on a finite state space, without any assumption of aperiodicity. 
(See also the Remark following Corollary [5] above.) 



Remark. We note that for periodic chains as in Proposition[30l the formula (I19|) 
for the asymptotic variance a 2 continues to hold without change. The rela- 
tion (j 2 = TVar7r(/i) also continues to hold, except that now the formula 
for the integrated autocorrelation time r requires that the sum taken over 
ranges whose lengths are multiples of d, i.e. the flexibly-ordered infinite sum 
r = J2kez Corr(X , Xf.) must be replaced by the more precisely limited sum 
r = lirrim^—Kx, Y^k=-td CorrpQi, Xk) (otherwise the sum will not converge, since 
now the individual terms do not go to 0). 



5.3. CLT Proofs using the Poisson Equation 

Here we provide proofs of some of the results stated in the previous subsection. 

We begin by stating a version of the martingale central limit theorem, which 
was proved independently by Billingsley |12] and Ibragimov [39] : see e.g. p. 375 
of Durrett [25]. 
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Theorem 31. (Billingsley flty and Ibragimov fS9f) Let {Z n } be a stationary 
ergodic sequence, with ~E[Z n \ Zi,...,Z n _i] = and E[(Z„) 2 ] < oo. Then 
n -1 ' 2 Zi converges weakly to a N(0, a 2 ) distribution for some a 2 < oo. 

To make use of Theorem 1311 consider the Poisson equation: h — n(h) — 
g — Pg. A useful result is the following (see e.g. Theorem 17 .4.4 of Meyn and 
Tweedie [54]): 

Theorem 32. Let P be a transition kernel for an aperiodic, ^-irreducible 
Markov chain on a state space X , having stationary distribution 7r(-), with Xq ~ 
7r(-). Let h : X — > R with 7r(ft 2 ) < oo, and suppose there exists g : X — > R with 
7r(g 2 ) < oo which solves the Poisson equation, i.e. such that h — n(h) = g — Pg. 
Then h satisfies a ^/n-CLT. 

Proof. Let Z n = g{X n ) — Pg{X n -x). Then {Z n } is stationary since X ~ tt(-). 
Also {Zn} is ergodic since the Markov chain converges asymptotically (by The- 
orem[3]). Furthermore, E[Z 2 ] < 4 7r(<7 2 ) < oo. Also, 

E[g{X n ) - Pg{X n ^) \ X , . . . , X n _ x ] = E[g(X n ) | X n _i] - Pg(X n ^) 

= Pg(X n _ 1 )-Pg(X n _ 1 )=0. 

Since Z\, . . . , Z n —\ £ cr(X , . . . , X n -i), it follows that E^[Z„ | Z\, . . . , Z n -{\ — 
0. Hence, by Theorem SB n- 1 ' 2 £™=i z i 

converges weakly to N(0, a 2 ). But 



' 1/2 $>(**) - n(h)} = n- 1 / 2 ^^) P9(Xi)} 

i=l 

n 

: - ^.9(^-i)] + n- l ' 2 Pg(X a ) - n-V 2 Pg(X n ) 

i=l 
n 

= n- 1 ' 2 Zi + n-^PgiXo) - n -^ 2 Pg(X n ). 

i=l 

The result follows since n~ 1 ^ 2 g(Xo) and n~ 1 ^ 2 Pg(X n ) both converge to zero in 
probability as n — ► 00. □ 

Corollary 33. IfJ2T=o V 7r (( pk [ h ~ 7r ( /l )]) 2 ) < °o, then h satisfies a y^.-CLT. 
Proof. Let 

g k (x) = P k h{x) - ir(h) = P k [h - n(h)}(x), 

where by convention P°h(x) = h(x), and let g(x) — X)fc=o 9k( x )- Then wc 
compute directly that 

00 00 00 00 

{g-Pg)(x) = ^2gk(x)-^2Pgk(x) = ^2gk{x)-^2gk(x) 

k=0 k=0 k=Q k=l 

= g (x)=P°h(x)-ir(h)=h(x)-ir(h). 
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Hence, the result follows from Theorem [351 provided that Tr(g 2 ) < oo. On the 
other hand, it is known (in fact, since Cov(X,Y) < y/Var(X) Var(Y)) that 
the L 2 (ir) norm satisfies the triangle inequality, so that 



fc=0 

so that 7r(<7 2 ) < oo provided J^feLo V^id'k) < 00 ■ ^3 
Proof of Theorem EH Let 

f(y)P(x,dy)) n(dx) 



\P\\& W = sup ir((Pf) 2 ) = sup f ( f 

*(/)=<> t(/)=o JxeA' \Jv 



be the usual L 2 (ir) operator norm for P, when restricted to those functionals 
/ with 7f(/) = and 7r(/ 2 ) < 00. Then it is shown in Theorem 2 of [69] that 
reversible chains are geometrically ergodic if and only if they satisfy ||P|| i 2 ( - 7r ) < 
1, i.e. there is (3 < 1 with tt((P/) 2 ) < /3 2 tt(/ 2 ) whenever tt(/) = and tt(/ 2 ) < 
00. Furthermore, reversibility implies self-adjointness of P in L 2 (ir), so that 
\\P k h>(«) = \\P\\ k L ^ V and hence n((P k f) 2 ) < P 2k ir(f 2 ). 

Let g k = P k h-n{h) as in the proof of Corollary [33J Then this implies that 
7r((fffc) 2 ) < P 2k n((h - Tr(h) 2 ), so that 



£ ^(g 2 ) < ^n((h-n(h)) 2 ) ]T f3 k = ^((ft - tt(/ 1 )) 2 )/(1 - /?) < 00. 

fc=0 fe=0 

Hence, the result follows from Corollary [33] □ 

Proof of Theorem[28] By FactHU] there is C < 00 and p < 1 with \P n f(x) - 
i"(/)| < CF(a;)p" for x £ X and / < V, and furthermore 7r(V) < 00. Let 
gk = P k [h — n(h)] as in the proof of Corollary[33j Then by the Cauchy-Schwartz 
inequality, (g k ) 2 = (P k [h~ Tr(h)}) 2 < P k {[h-n{h)] 2 ). On the other hand, since 
[h - Tr(h)] 2 < KV, so [h - Ah)] 2 /K < V, we have (g k ) 2 < P k ([h - tt(/i)] 2 ) < 
CKVp k . This implies that 7r(( fffc ) 2 ) < CKp k ir(V), so that 



J2 J <9t) < VCK7r«h-?r(h)) 2 )J2r k/2 



k=0 k=0 



OO. 



= y/CK*{(h-*{h))*)/{l - y/p) < 

Hence, the result again follows from Corollary [33J □ 
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5-4- Proof of Theorem \24\ using Regenerations 

Here we use regeneration theory to give a reasonably direct proof of Thcorcm l241 
following the outline of Hobert et al. [38], thereby avoiding the technicalities of 
the original proof of Ibragimov and Linnik [40] . 

We begin by noting from FactfTOlthat since the chain is geometrically ergodic, 
there is a small set C and a drift function V satisfying (|8]) and (fTO]) . 

In terms of this, we consider a regeneration construction for the chain (cf. 
01 [I], [57]) [SB])- This is very similar to the coupling construction presented 
in Section [4] except now just for a single chain {X n }. Thus, in the coupling 
construction we omit option 1, and merely update the single chain. More for- 
mally, given X n , we proceed as follows. If X n ^ C, then we simply choose 
X n +i ~ P(X n , •). Otherwise, if X n E C, then with probability e we choose 
X n+na ~ !/(•), while with probability 1 — e we choose X n+na ~ R(X n , •). [If 
no > 1, we then fill in the missing values X n _|_i, . . . , X n+no ^i as usual.] 

We let T\,Ti, . . . be the regeneration times, i.e. the times such that Xj^ ~ 
v{-) as above. Thus, the regeneration times occur with probability e precisely 
uq iterations after each time the chain enters C (not counting those entries of 
C which are within no of a previous regeneration attempt). 

The benefit of regeneration times is that they break up sums like X)"=n 
7r(/i)] into sums over tours, each of the form Y^I^r- 1 [h(Xi) ™ ""COl- Further- 
more, since each subsequent tour begins from the same fixed distribution v (•), we 
see that the different tours, after the first one, are independent and identically 
distributed (i.i.d.). 

More specifically, let To = 0, and let r(n) = sup{i > 0; Tj < n}. Then 

n r(n)Tj + i — 1 

^2[h(x i )-x(h)] = Yl E [W-tWI+^w, (21) 

»=1 j=l i=Tj 

where i?(n) is an error term which collects the terms corresponding to the 
incomplete final tour Xt. )+1 , . . . , X n , and also the first tour Xq, . . . , Xt ± -i- 

Now, the tours {{Xt s , Xt s +i, ■ ■ ■ , -Xr i+1 -i}, i = 1,2,.. .} are independent 
and identically distributed. Moreover, elementary renewal theory (see for exam- 
ple [4]) ensures that r(n)/n — * £7r(C) in probability. Hence, the classical central 
limit theorem (see e.g. [13], Theorem 27.1; or [83], p. 110) will prove Theorcml2~4l 
provided that each term has finite second moment, and that the error term E(n) 
can be neglected. 

To continue, we note that geometric ergodicity implies (as in the proof of 
Lemma [T8]) exponential tails on the return times to C. It then follows (cf. 
Theorem 2.5 of [94] ) that there is @ > 1 with 

E^^ 1 ] < oo, and E[/3 Tj+1 " Tj ] < oo. (22) 

(This also follows from Theorem 15.0.1 of [53], together with a simple argument 
using probability generating functions.) 

Now, it seems intuitively clear that E{n) is O p (l) as n — * oo, so when 
multiplied by n" 1 / 2 , it will not contribute to the limit. Formally, this follows 
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from (j2"2"]) . which implies by standard renewal theory that E(n) has a limiting 
distribution as n — > oo, which in turn implies that E(n) is O p {\) as n — ► oo. 
Thus, the term E(n) can be neglected without affecting the result. 

Hence, it remains only to prove the finite second moments of each term 
in (f2Tj) . Recalling that each tour begins in the distribution 2/(-), we see that the 
proof of Theorem [23] is completed by the following lemma: 

Lemma 34. J xeX v{dx) E[( Y^Lq - <h)]f \ X = x] < oo. 

Proof. Note that 



7T(.) 



w(dx)P(x,-)> / ir(dx)P(x,-)>Tr(G)ev(-), 



so that v(dx) < n(dx) /n(C) e. Hence, it suffices to prove the lemma with v(dx) 
replaced by 7r(dx) i.e. under the assumption that Xq ~ 7r(-). 

For notational simplicity, set Hi = h(X{) — n(h), and E ff [• • •] = J xeAr E[- • • | 

X = x] n(dx). Note that (E^q^PQ) - < h )\? = (TZo Ukt^) 2 . Hence, 
by Cauchy-Schwartz, 



E„ 



'Ti-l 



^[^O-TrW] 



i=0 



(23) 



, i=0 



To continue, let p = 1 + 2/(5 and q = 1 + 5/2, so that l/p+l/q= 1. Then 
by Holder's inequality (e.g. [13j . p. 80), 



E,r[lj<Ti-£ff] _ ; E jr [lj < :r 1 ] 1 / p E^fjiJi 



291 1/9 



(24) 



Now, since Xq ~ t(-), therefore | 2q ] = if is a constant, independent 

of i, which is finite since 7r(|/i| 2+(5 ) < oo. 

Also, using ([22jl. Markov's inequality then gives that E 7r [l <i<Ti] < 
E n [l pT 1>/3 i] < /?""*£„. [/3 Tl ]. Hence, combining and (|2"3|). we obtain that 



Ett 



T,-l 



i=0 



< ^^[l^Ji/PE^I^I^iA 



< i^^^-i^T^l/p = #1/29^^1/2*,^ 



-t/2 



i=0 



i=0 



= (if 1 /29 E7r [ (5 T l] l/2 P / (1 _ < oo. □ 

It appears at first glance that Theorem [23] could be proved by similar regen- 
eration arguments. However, we have been unable to do so. 
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Open Problem # 2. Can Theorem \2S\ be proved by direct regeneration argu- 
ments, similar to the above proof of Theorem \24\ ? 

6. Optimal Scaling and Weak Convergence 

Finally, we briefly discuss another application of probability theory to MCMC, 
namely the optimal scaling problem. Our presentation here is quite brief; for 
further details see the review article [74] , 

Let tt u : R d — > [0, oo) be a continuous d-dimensional density (d large). 
Consider running a Metropolis-Hastings algorithm for ir u . The optimal scaling 
problem concerns the question of how we should choose the proposal distribution 
for this algorithm. 

For concreteness, consider either the random-walk Metropolis (RWM) algo- 
rithm with proposal distribution given by Q(x, •) = N(x, a 2 Id), or the Langevin 
algorithm with proposal distribution given by Q(x, •) = iV(2+^-Vlog7r„(ir), cr 2 /^). 
In either case, the question becomes, how should we choose a 2 ? 

If a 2 is chosen to be too small, then by continuity the resulting Markov 
chain will nearly always accept its proposed value. However, the proposed value 
will usually be extremely close to the chain's previous state, so that the chain 
will move extremely slowly, leading to a very high acceptance rate, but very 
poor performance. On the other hand, if a 2 is chosen to be too large, then 
the proposed values will usually be very far from the current state. Unless the 
chain gets very "lucky", then those proposed values will usually be rejected, 
so that the chain will tend to get "stuck" at the same state for large periods 
of time. This will lead to a very low acceptance rate, and again a very poorly 
performing algorithm. We conclude that proposal scalings satisfy a Goldilocks 
Principle: The choice of the proposal scaling a 2 should be "just right" , neither 
too small nor too large. 

To prove theorems about this, assume for now that 

(/ 

tt«(x)= J] /(*<)> ( 25 ) 
»=1 

i.e. that the density -k u factors into i.i.d. components, each with (smooth) density 
/. (This assumption is obviously very restrictive, and is uninteresting in practice 
since then each coordinate can be simulated separately. However, it does allow 
us to develop some interesting theory, which may approximately apply in other 
cases as well.) Also, assume that chain begins in stationarity, i.e. that Xq ~ tt(-). 

6.1. The Random Walk Metropolis (RWM) Case 

For RWM, let I = E[((log f(Z))') 2 } where Z ~ f(z)dz. Then it turns out, 
essentially, that under the assumption (|25[) . as d — * oo it is optimal to choose 
a 2 = (2.38) 2 //d, leading to an asymptotic acceptance rate = 0.234. 
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More precisely, set the proposal variance to be u\ — £ 2 /d, where £ > is to 
be chosen later. Let {X n } be the Random Walk Metropolis algorithm for tt(-) 
on R d with proposal variance a\. Also, let {N(t)} t >o be a Poisson process with 
rate d which is independent of {X„}. Finally, let 

Z t= X N(t)i 1 ^ °- 

Thus, {Zf}t>o follows the first component of {X n }, with time speeded up by a 
factor of d. 

Then it is proved in [67] (see also [74]), usm g the theory from Ethier and 
Kurtz [26], that as d — > oo, the process {Zf} t > converges weakly to a diffusion 
process {Z t } t >o which satisfies the following stochastic differential equation: 

dZ t = h{lf/ 2 dBt + \ h(£) V logvr M (Zt) dt. 

Here 

corresponds to the speedoi the limiting diffusion, where = — i== f*^ &~ s ^ 2 ds 
is the cdf of a standard normal distribution. 

We then compute numerically that the choice 1 = 1= 2.38/vT maximises 
the above speed function h(£), and thus must be the choice leading to optimally 
fast mixing (at least, as d — > oo). Furthermore, it is also proved in [67j that 
the asymptotic (i.e., expected value with respect to the stationary distribution) 

acceptance rate of the algorithm is given by the formula A(£) = 2 $ {^^-^j , and 

we compute that A{£) = 0.234, thus giving the optimal asymptotic acceptance 
rate. 

6.2. The Langevin Algorithm Case 

In the Langevin case, let J = E[(5((log f{Z))"')) 2 - 3((log/(Z))") 3 )/48] where 
again Z ~ f(z)dz. Then it turns out, essentially, that assuming (|25|) . it is 
optimal as d — > oo to choose a 2 = (0.825) 2 / J 1 / 2 ^ 1 / 3 , leading to an asymptotic 
acceptance rate = 0.574. 

More precisely, set a 2 d = ^/d 1 / 3 , let {X n } be the Langevin Algorithm for 
7r(-) on R d with proposal variance er 2 ., let {N(t)} t >o be a Poisson process with 
rate d 1 / 3 which is independent of {X n }, and let 

yd _ 

— AT(t)' 

so that {Zf} t >o follows the first component of {X n }, with time speeded up by 
a factor of d 1 ^ 3 . Then it is proved in [7D] (see also [71]) that as d — > oo, the 
process {Z^} t >o converges weakly to a diffusion process {Z t }t>o which satisfies 
the following stochastic differential equation: 

dZ t = g{£f/ 2 dB t + i V Iog7r„(Z t ) dt. 
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Here 

g{t) = 2£ 2 <f>(-J£ 3 ) 

represents the speed of the limiting diffusion. We then compute numerically 
that the choice I = £ = 0.825/77 maximises g(i), and thus must be the choice 
leading to optimally fast mixing (at least, as d — + oo). Furthermore, it is proved 
in |T0] that the asymptotic acceptance rate satisfies A(£) = 2<I>(— J£ 3 ) = 0.574, 
thus giving the optimal asymptotic acceptance rate for the Langevin case. 

6.3. Discussion of Optimal Scaling 

The above results show that for either the RWM or the Langevin algorithm, 
under the assumption (|25p . we can determine the optimal proposal scaling just 
in terms of universally optimal asymptotic acceptance rates (0.234 for RWM, 
0.574 for Langevin). Such results are straightforward to apply in practice, since it 
is trivial for a computer to monitor the acceptance rate of the algorithm, and the 
user can modify a 2 appropriately to achieve appropriate acceptance rates. Thus, 
these optimal scaling rates are often used in applied contexts (see e.g. M0ller et 
a!. [56]). (It may even be possible for the computer to adaptively modify a 2 to 
achieve the appropriate acceptance rates; see [5] and references therein. However 
it is important to recognise that adaptive strategies can violate the stationarity 
of 7r so they have to be carefully implemented; see for example 35J.) 

The above results also describe the computational complexity of these algo- 
rithms. Specifically, they say that as d — > oo, the efficiency of RWM algorithms 
scales like d^ 1 , so its computational complexity is 0(d). Similarly, the efficiency 
of Langevin algorithms scales like d -1 / 3 , so its computational complexity is 
0(d 1 ^ 3 ) which is much lower order (i.e. better). 

We note that for reasonable efficiency, we do not need the acceptance rate 
to be exactly 0.234 (or 0.574), just fairly close. Also, the dimension doesn't 
have to be too large before asymptotics approximately kick in; often 0.234 is 
approximately optimal in dimensions as low as 5 or 10. For further discussion 
of these issues, see the review article [74]. 

Now, the above results are only proved under the strong assumption ([25]) . 
It is natural to ask what happens if this assumption is not satisfied. In that 
case, there are various extensions of the optimal-scaling results to cases of 
inhomogeneously-scaled components of the form 7r u (x) = J\i=i Q f(Ci x i) 
(see [74]), to the discrete hypercube [65], and to finite-range homogeneous 
Markov random fields [14] ; in particular, the optimal acceptance rate remains 
0.234 (under appropriate assumptions) in all of these cases. On the other hand, 
surprising behaviour can result if we do not start in stationarity, i.e. if the as- 
sumption Xo ~ 7r(-) is violated and the chain instead begins way out in the tails 
of 7r(-); see [TB]. The true level of generality of these optimal scaling results is 
currently unknown, though investigations are ongoing [lOj . In general this is an 
open problem: 
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Open Problem # 3. Determine the extent to which the above optimal scaling 
results continue to apply, even when assumption \25\) is violated. 

APPENDIX: Proof of Lemma QT] 

Lemma [T71 above states (Meyn and Tweedie [54] ■ Theorem 5.5.7) that for an 
aperiodic, </>-irreducible Markov chain, all petite sets are small sets. 
To prove this, we require a lemma related to aperiodicity: 

Lemma 35. Consider an aperiodic Markov chain on a state space X, with 
stationary distribution 7r(-). Let be any probability measure on X . Assume 
that <C tt(-); an d that for all x € X, there is n — n(x) G N and 6 = 
S(x) > such that P n (x,-) > 8v{-) (for example, this always holds if v{-) is 
a minorisation measure for a small or petite set which is reachable from all 
states). Let T = {n > 1; 35 n > s.t. J v(dx) P n (x, •) > 6 n i>(-)}, and assume 
that T is non-empty. Then there is € N with T D {n*, rt* + 1,71* + 2, . . .}. 

Proof. Since pW x )) (x, •) > 6(x) v{-) for all x G X, it follows that T is non- 
empty. 

Now, if n, m e T, then since J xeX v{dx) P n+m (x, •) = j xeX j yeX v {dx) x 
P n (x,dy)P m (y,-) > j yeX 5 n v(dy)P m (y,-) > 5 n 5 m v(-), we see that T is additive, 
i.e. if n, to e T then n + to e T. 

We shall prove below that gcd(T) = 1. It is then a standard and easy fact (e.g. 
[13], p. 541; or [83], P- 77) that if T is non-empty and additive, and gcd(T) = 1, 
then there is fi, € N such that T 2 {n*, + 1, n* + 2, . . .}, as claimed. 

We now proceed to prove that gcd(T) = 1. Indeed, suppose to the contrary 
that gcd(T) = d > 1. We will derive a contradiction. 

For 1 < i < d, let 

X, = {x e X; 3£ e N and 6 > s.t. P u ~ l {x, ■) > Sv{-)}. 
Then Afj = A" by assumption. Now, let 

s = \J{Xi n x 3 ), 

let _ 

S = S(J{xeX; 3m £ N s.t. P" l (a;, S 1 ) > 0}, 

and let 

X[ = Xi \ S. 

Then X\, X2, . . . , Xd are disjoint by construction (since we have removed S). 
Also if x € X[, then P(x,S) = 0, so that P(a;, U J= i ^j) — 1 by construction. 
In fact we must have P(x,X[ +1 ) = 1 in the case i < d (with P(x, X[) = 1 for 
i = d), for if not then x would be in two different X'a at once, contradicting their 
disjointedness. 
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We claim that for all m > 0, vP m (XiC\Xj) — whenever i ^ j. Indeed, if we 
had vP m {X i nXj)>0 for some i^j, then there would be 5" C X, € N, 

and 6 > such that for all x G S', P^ d+l (x, ■) > v{-) and P* 2>d +i(x, •) > v(-), 
implying that £\d + i + m € T and + j ' + to G T, contradicting the fact that 
gcd(T) = d. 

It then follows (by sub-additivity of measures) that v(S) = 0. Therefore, 

KUti X l) = KUf=i x i) = V ( X ) = L Since " < we must have t(U<=i 4) > 
0. 

We conclude from all of this that X[,...,X' d are subsets of positive ir- 
measure, with respect to which the Markov chain is periodic (of period d), 
contradicting the assumption of aperiodicity. □ 

Proof of Lemma ll7l Let R be (no, e, ^(-))-petite, so that Yj&Li -P l ( x > ') ^ e K') 
for all x £ R. Let T be as in Lemma [35l Then J27=i Ixex v (dx) P l {x, ■) > €v(-), 
so we must have i £ T for some 1 < i < no, so that T is non-empty. Hence, 
from Lemma l35l we can find n* and S n > such that / v(dx)P n (x, ■) > 8 n v(-) 

for all n > n*. Let r = min |<5 n ; n* < n < n* + no — l\, and set ./V = n* +no- 
Then for x <E R, 




> 




> 




kgef % 



rv{dy)tu{-) 



rev(-). 



Thus, R is (N, re, z/(-))-small. 



□ 
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